Methods and Compositions for Large-Scale Analysis of Nucleic Acids Using DNA Deletions

ABSTRACT

The present invention is related generally to analysis of polynucleotides, particularly polynucleotides derived from genomic DNA. The invention provides methods, compositions and systems for such analysis. Encompassed by the invention are constructs that include pairs of target sequences which are separated by a known distance in the polynucleotide from which they are derived.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims priority to provisional application Ser. No. 60/864,992, filed Nov. 9, 2006, which is hereby incorporated by reference in its entirety.

BACKGROUND OF THE INVENTION

Large-scale sequence analysis of genomic DNA is central to understanding a wide range of biological phenomena related to states of health and disease both in humans and in many economically important plants and animals, e.g., Collins et al (2003), Nature, 422: 835-847; Service, Science, 311: 1544-1546 (2006); Hirschhorn et al (2005), Nature Reviews Genetics, 6: 95-108; National Cancer Institute, Report of Working Group on Biomedical Technology, “Recommendation for a Human Cancer Genome Project,” (February, 2005); Tringe et al (2005), Nature Reviews Genetics, 6: 805-814. The need for low-cost high-throughput sequencing and re-sequencing has led to the development of several new approaches that employ parallel analysis of many target DNA fragments simultaneously, e.g., Use of water/buffer-in-oil emulsions to carry out enzymatic reactions is well known in the art, particularly carrying out PCRs, e.g., as disclosed by Drmanac et al., Scienta Yugoslavica, 16(1-2): 97-107 (1990), Margulies et al, Nature, 437: 376-380 (2005); Margulies et al, Nature, 437: 376-380 (2005); Shendure et al (2005), Science, 309: 1728-1732; Metzker (2005), Genome Research, 15: 1767-1776; Shendure et al (2004), Nature Reviews Genetics, 5: 335-344; Lapidus et al, U.S. patent publication US 2006/0024711; Drmanac et al, U.S. patent publication US 2005/0191656; Brenner et al, Nature Biotechnology, 18: 630-634 (2000); and the like.

Such approaches reflect a variety of solutions for increasing target polynucleotide density in planar arrays and for obtaining increasing amounts of sequence information from each application of a sequence detection reaction.

Most traditional methods of sequence analysis are restricted to determining a few tens of nucleotides before signals become significantly degraded, thus placing a significant limit on overall sequencing efficiency. Such short sequence reads are particularly problematic in regions of a target sequence which contain long strings of repeating nucleotides or tandem repeats.

In view of such limitations, it would be advantageous for the field if methods and tools could be designed to increase the efficiency of sequencing reactions as well as the efficiency of assembling complete sequences from shorter read lengths.

SUMMARY OF THE INVENTION

In one aspect, the invention provides a method for forming a polynucleotide that includes a deletion mate pair. This method includes the step of providing a first linear construct, which includes a first adaptor interposed between a first target polynucleotide fragment and a second target polynucleotide fragment. The first target polynucleotide fragment and the second target polynucleotide fragment are contiguous nucleic acids within a target polynucleotide. In a further step, a deletion adaptor is ligated to the first linear construct to form a second linear construct. This deletion adaptor includes a recognition site for a restriction endonuclease, and the restriction endonuclease in the deletion adaptor cleaves at a known distance from its recognition site. The restriction endonuclease is applied to cleave the second linear construct to form a third linear construct, thus forming the polynucleotide that includes a deletion mate pair.

In another aspect, the invention provides a method for forming a circular polynucleotide that includes a deletion mate pair. This method includes the step of providing a first circular construct. The first circular construct includes a first adaptor and a target polynucleotide. The first adaptor includes a recognition site for a first restriction endonuclease that cleaves at a known distance from the recognition site and a recognition site for a second restriction endonuclease that cleaves within the first adaptor. The first restriction endonuclease is used to cleave the first circular construct to form a first linear construct. The first linear construct is in turn cleaved with the second restriction endonuclease to form a second linear construct. The second linear construct is then circularized to create a second circular construct, thus forming the circular polynucleotide that includes a deletion mate pair.

In yet another aspect, the invention provides a method for forming a polynucleotide that includes a deletion mate pair. This method includes the step of providing a first linear construct. This first linear construct includes a target polynucleotide and an adaptor, and in addition, a first adaptor is attached to one end of the polynucleotide. A deletion adaptor is ligated to the end of the first linear construct opposite the first adaptor, and the deletion adaptor includes a recognition site for a restriction endonuclease that cleaves at a known distance from the recognition site. The restriction endonuclease is applied to cleave the first the first linear construct to form a second linear construct, thus forming the polynucleotide that includes a deletion mate pair.

In still another aspect, the invention provides a method for forming a polynucleotide that includes a deletion mate pair. This method includes the step of providing a first linear construct which includes a target polynucleotide. A deletion adaptor is ligated to one end of the first linear construct, and the deletion adaptor includes a recognition site for a restriction endonuclease that cleaves at a known distance from the recognition site. The first linear construct is cleaved with the restriction endonuclease to form a second linear construct, thus forming the polynucleotide that includes a deletion mate pair.

In another aspect, the invention provides a method for forming a polynucleotide that includes a deletion mate pair. This method includes the step of providing a first linear construct that includes a target polynucleotide. A deletion adaptor is ligated to one end of the linear construct, and this deletion adaptor comprises a recognition site for a restriction endonuclease that cleaves at a known distance from the recognition site. The first linear construct is cleaved the restriction endonuclease to form a second linear construct, thus forming the polynucleotide that includes a deletion mate pair.

In still another aspect, the invention provides a method for forming a polynucleotide that includes a deletion mate pair. The method includes the step of providing a first circular construct. The first circular construct includes a first adaptor and a target polynucleotide, and the first adaptor includes a recognition site for a first restriction endonuclease that cleaves at a known distance from the recognition site. The first circular construct is then cleaved with the first restriction endonuclease to form a first linear construct. In a further step, a second adaptor is provided, and the second adaptor includes a recognition site for a second restriction endonuclease that cleaves at a known distance from the recognition site. The second adaptor is ligated to one end of the first linear construct to create a second linear construct, the second linear construct is then circularized to form a second circular construct, thus forming the polynucleotide that includes a deletion mate pair.

In one aspect of the invention, precise mate pair deletion constructs comprise a deletion of a specific length (e.g., about 10-100 or more bases) or a series of deletions of known length multiples, e.g., a set of constructs comprising constructs with a known 10 nt deletion constructs with a known 20 nt deletion, constructs with a known 30 nt deletion. Such precise mate pair deletion constructs can be used to extend read lengths, by cleaving circularized target nucleotides, deleting a known number of bases at the cleavage site, identifying bases on each side of the deletion, and analyzing the combined data of the precise mate pair constructs to form an indirectly extended read length comprising of both directly determined and deleted bases.

In another aspect of the invention, sequencing reactions using precise deletion mate pair constructs and conventional mate pair constructs are utilized. Preferably, the sequencing reads of the combined nucleotides will span the length of the known deletion in any of the deletion mate pair constructs.

In one aspect of the invention, a library of constructs are prepared, wherein the library comprises staggered restriction fragments, with each fragment comprising a defined deletion on one or both sides of the fragment. Sequencing reads from these libraries provide longer combined read lengths than the use of the fragments alone. These library constructs may comprise both precise deletion mate pairs and/or traditional mate pairs.

In one aspect, the invention provides a method for analyzing a polynucleotide sequence. This method includes providing a deletion mate pair construct. In a preferred aspect, the deletion mate pair construct includes the following: (i) a first adaptor, (ii) a second adaptor, (iii) a first target sequence, and (iv) a second target sequence. The first target sequence and the second target sequence span a portion of the polynucleotide sequence. The method includes the step of identifying at least one nucleotide of the first target sequence and at least one nucleotide of the second target sequence, thereby analyzing the polynucleotide sequence.

In one aspect, the invention provides a method for forming a library of a plurality of circularized deletion mate pair constructs. This method includes ligating a deletion adaptor to each of a plurality of first linear constructs. The deletion adaptor includes a recognition site for a restriction endonuclease that cleaves at a known distance from the recognition site. At least a portion of the plurality of first linear constructs is cleaved with the restriction endonuclease to provide a plurality of second linear constructs. At least a portion of the plurality of the second linear constructs is circularized, thus forming the library of circularized deletion mate pair constructs.

In one aspect, the invention provides a method for forming a random array. In this method, a support with a surface is provided, as is a plurality of deletion mate pair constructs. The plurality of deletion mate pair constructs is immobilized on the surface, thereby forming the random array. In a further aspect the invention provides random arrays made according to this method.

In another aspect, the invention provides a library that includes a plurality of deletion mate pair constructs. The plurality of deletion mate pair constructs include target sequences, and the target sequences together represent at least about 80% of a genome.

In another aspect, the invention provides a library that includes a plurality of circularized deletion mate pair constructs. The plurality of deletion mate pair constructs includes target sequences, and the target sequences represent at least about 80% of a genome. In a preferred aspect, each of the plurality of circularized deletion mate pair constructs includes a first adaptor, a first target sequence, and a second target sequence. In this aspect, the first target sequence and the second target sequence are separated by a known number of bases within the genome.

In one aspect, the invention provides a substrate that includes a plurality of immobilized concatemers. In this aspect, each unit of the concatemer includes a deletion mate pair construct, and the deletion mate pair construct includes a first target sequence and a second target sequence. In addition, the first target sequence and the second target sequence are derived from a target polynucleotide; the first target sequence and the second target sequence are separated by a known distance within the target polynucleotide.

In one aspect, the invention provides an amplicon made by amplification of a circular library construct. The circular library construct includes target nucleic acid interspersed with a plurality of adaptors, and at least two sets of the adaptors are positioned on either side of a target polynucleotide of known length.

In another aspect, the invention provides a plurality of amplicons of circular library constructs. Each amplicon includes target nucleic acid interspersed with a plurality of adaptors, and at least two sets of the adaptors are positioned on either side of a target polynucleotide of known length.

In one aspect, the invention provides a kit for selecting for desired orientations of multiple adaptors in library constructs. The kit includes the following elements: (a) a first double-stranded adaptor, which includes a recognition site for a first Type IIs restriction endonuclease; a second double-stranded adaptor, which includes a restriction site for a second Type IIs restriction endonuclease; and (c) primers complementary to both ends of each of the first and second double-stranded adaptors.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates a method of circularization using an adaptor.

FIG. 2 illustrates the general concept of precise deletion mate pair formation.

FIG. 3 illustrates the use of mate pair deletions for determining repeats within a target nucleotide.

FIG. 4 illustrates a general method for creating deletion mate pair constructs.

FIG. 5 illustrates one method for creating a circularized deletion mate pair construct.

FIG. 6 illustrates one method using multiple deletion cycles to form the circular deletion mate pair construct.

FIG. 7 illustrates another method to form multiple deletion mate pair constructs.

FIG. 8 illustrates another method to form multiple circular deletion mate pair constructs.

FIG. 9 illustrates yet another method to form multiple deletion mate pair constructs.

FIG. 10 illustrates one method using deletion cycles to form the circular deletion mate pair construct

FIG. 11 illustrates the use of an adaptor with two exact cutting sites used method to form circular deletion mate pair constructs.

FIG. 12 illustrates another method to form multiple circular deletion mate pair constructs.

FIG. 13 illustrates a method of creating a construct comprising shorter target sequences.

FIGS. 14 and 15 illustrate one aspect of the embodiments for creating concatemers for use in the invention.

FIG. 16 illustrates a method for creating an array comprising amplicons of the invention.

FIG. 17 illustrates a top view of placement of concatemers onto discrete regions on an array surface.

FIG. 18 illustrates the placement of concatemers in arrays with distinct regions for attachment.

DETAILED DESCRIPTION OF THE INVENTION

The practice of the present invention may employ, unless otherwise indicated, conventional techniques and descriptions of organic chemistry, polymer technology, molecular biology (including recombinant techniques), cell biology, biochemistry, and immunology, which are within the skill of the art. Such conventional techniques include polymer array synthesis, hybridization, ligation, and detection of hybridization using a label. Specific illustrations of suitable techniques can be had by reference to the example herein below. However, other equivalent conventional procedures can, of course, also be used. Such conventional techniques and descriptions can be found in standard laboratory manuals such as Genome Analysis: A Laboratory Manual Series (Vols. I-IV), Using Antibodies: A Laboratory Manual, Cells: A Laboratory Manual, PCR Primer: A Laboratory Manual, and Molecular Cloning: A Laboratory Manual (all from Cold Spring Harbor Laboratory Press), Stryer, L. (1995) Biochemistry (4th Ed.) Freeman, New York, Gait, “Oligonucleotide Synthesis: A Practical Approach” 1984, IRL Press, London, Nelson and Cox (2000), Lehninger, Principles of Biochemistry 3rd Ed., W.H. Freeman Pub., New York, N.Y. and Berg et al. (2002) Biochemistry, 5^(th) Ed., W.H. Freeman Pub., New York, N.Y., all of which are herein incorporated in their entirety by reference for all purposes.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. All patents, patent publications and other publications mentioned herein are incorporated herein by reference for the purpose of describing and disclosing devices, formulations and methodologies which are described in the publication and which might be used in connection with the presently described invention.

Where a range of values is provided, it is understood that each intervening value, between the upper and lower limit of that range and any other stated or intervening value in that stated range is encompassed within the invention. The upper and lower limits of these smaller ranges may independently be included in the smaller ranges, and are also encompassed within the invention, subject to any specifically excluded limit in the stated range. Where the stated range includes one or both of the limits, ranges excluding either both of those included limits are also included in the invention.

In the following description, numerous specific details are set forth to provide a more thorough understanding of the present invention. However, it will be apparent to one of skill in the art that the present invention may be practiced without one or more of these specific details. In other instances, well-known features and procedures well known to those skilled in the art have not been described in order to avoid obscuring the invention.

Overview

The invention provides methods and compositions for producing deletion mate pairs and deletion mate pair constructs. Deletion mate pairs are generally two target sequences which are separated by a known distance within the polynucleotide from which they are derived. Deletion mate pair constructs are polynucleotide molecules which include at least on deletion mate pair.

The use of deletion mate pair constructs, either with or without the use of conventional mate pairs, allows for the indirect sequencing of sequences than can be obtained using only conventional mate pairs for sequencing. Longer sequence read lengths provided using deletion mate pairs provides sequence information generally only available by performing longer sequence reads. Thus, the present invention provides similar advantages to conventional techniques of reading every nucleotide, without the need to obtain longer reads, which are more expensive and more difficult to generate, especially in the high throughput high density DNA arrays. Effectively, the read length that can be obtained using these overlapping fragments with precise deletions allows determination of the deleted sequence region without direct identification of such sequences. This obtained information is useful in polynucleotide analysis, and can be used, e.g., for determining length of tandem repeats or for unique mapping and assembly of long and/or dispersed nucleotide repeats. In a specific example, the number of deleted bases may be, e.g., between 2 and 10 times the length bases identified in specific sequencing reaction, thus effectively extending the read length of such methods 2-10 fold for specific sequences (such as those with single nucleotide repeats).

In general, the sequence of the deleted region is obtained through one or more overlapped target fragments derived from copies of the sample polynucleotide. The deletion of an exact or substantially exact number of bases in the creation of new mate pairs distinguishes this invention from use of traditional mate pairs, where the distance between a pair of sequences is highly approximate (e.g., +1-5% to 20% of the distance of usually 0.3 kb to 3 kb or longer).

Precise deletion mate pairs are particularly useful in sequencing extensive regions of repeating sequences (i.e., tandem repeats and especially triple repeats that frequently cause diseases when over-expanded), in identifying multiple “local” mutations, and in identifying long insertions and deletions. Furthermore, deletion of about I0-I00 bases in staggered fragments may help in removing or reducing secondary structures or sequences regions of extreme GC content, thus allowing higher quality of data or ability to sequence certain gene or genome regions.

In one aspect of the invention, use of deletion mate pairs provides selective determination of sequences of two polynucleotide segments with predefined distance, i.e. to skip direct identification of a defined (exact or almost exact) number of bases that are present (i.e. not deleted) in the analyzed target.

In another aspect, the invention also provides methods for using deletion mate pair constructs to generate amplicons and libraries. In addition, the invention provides methods of creating random arrays that include deletion mate pair constructs and amplicons of deletion mate pair constructs. Such arrays can, in accordance with the invention, be used to analyze the nucleotide sequences of deletion mate pairs. Sequence reads resulting from such analysis can be assembled more efficiently than is possible in traditional sequencing methods, because the sequence reads can be aligned based on not only overlapping sequences, but also based on the known lengths of the deleted regions separating each deletion mate pair. The methods of the invention are particularly useful in sequencing extensive regions of repeating sequences (i.e., tandem repeats), in identifying multiple “local” mutations, and in identifying long insertions and deletions

Compositions/Structures of Target Polynucleotides

The present invention provides compositions and methods that are derived from and/or utilize target polynucleotides from samples. As will be appreciated by those in the art, the sample solution may comprise any number of things, including, but not limited to, bodily fluids (including, but not limited to, blood, urine, serum, lymph, saliva, anal and vaginal secretions, perspiration and semen) and cells of virtually any organism, with mammalian samples being preferred and human samples being particularly preferred; environmental samples (including, but not limited to, air, agricultural, water and soil samples); biological warfare agent samples; research samples (i.e. in the case of nucleic acids, the sample may be the products of an amplification reaction, including both target and signal amplification, such as PCR amplification reactions; purified samples, such as purified genomic DNA, RNA preparations, raw samples (bacteria, virus, genomic DNA, etc.). In accordance with the present invention, samples may be subjected to virtually any experimental manipulation.

In general, cells from a target organism (animal, avian, mammalian, etc.) are used. When genomic DNA is used, the amount of genomic DNA required for constructing arrays and substrates of the invention can vary widely. In one embodiment genomic DNA, is obtained using conventional techniques, for example, as disclosed in Sambrook et al., supra, 1999; Current Protocols in Molecular Biology, Ausubel et al., eds., (John Wiley and Sons, Inc., NY, 1999), or the like. In a preferred embodiment, isolated genomic DNA is free of DNA processing enzymes and contaminating salts, represents the entire genome equally, and comprises DNA fragments with lengths from about 1,000 to about 100,000 base pairs in length. In a particularly preferred embodiment, human genomic DNA is used in methods and compositions of the invention.

In one aspect, for mammalian-sized genomes, fragments are generated from at least about 1 genome-equivalent of DNA; and in another aspect, fragments are generated from at least about 10 genome-equivalents of DNA; and in another aspect, fragments are generated from at least about 30 genome-equivalents of DNA. Target polynucleotides of the invention are nucleic acids. By “nucleic acid” or “oligonucleotide” or grammatical equivalents herein means at least two nucleotides covalently linked together. A nucleic acid of the present invention will generally contain phosphodiester bonds, although in some cases, as outlined below (for example in the construction of primers and probes such as label probes), nucleic acid analogs are included that may have alternate backbones, comprising, for example, phosphoramide (Beaucage et al., Tetrahedron 49(10):1925 (1993) and references therein; Letsinger, J. Org. Chem. 35:3800 (1970); Sprinzl et al., Eur. J. Biochem. 81:579 (1977); Letsinger et al., Nucl. Acids Res. 14:3487 (1986); Sawai et al, Chem. Lett. 805 (1984), Letsinger et al., J. Am. Chem. Soc. 110:4470 (1988); and Pauwels et al., Chemica Scripta 26:141 91986)), phosphorothioate (Mag et al., Nucleic Acids Res. 19:1437 (1991); and U.S. Pat. No. 5,644,048), phosphorodithioate (Briu et al., J. Am. Chem. Soc. 111:2321 (1989), O-methylphosphoroamidite linkages (see Eckstein, Oligonucleotides and Analogues: A Practical Approach, Oxford University Press), and peptide nucleic acid backbones and linkages (see Egholm, J. Am. Chem. Soc. 114:1895 (1992); Meier et al., Chem. Int. Ed. Engl. 31:1008 (1992); Nielsen, Nature, 365:566 (1993); Carlsson et al., Nature 380:207 (1996), all of which are incorporated by reference). Other analog nucleic acids include those with bicyclic structures including locked nucleic acids, Koshkin et al., J. Am. Chem. Soc. 120:13252 3 (1998); positive backbones (Denpcy et al., Proc. Natl. Acad. Sci. USA 92:6097 (1995); non-ionic backbones (U.S. Pat. Nos. 5,386,023, 5,637,684, 5,602,240, 5,216,141 and 4,469,863; Kiedrowshi et al., Angew. Chem. Intl. Ed. English 30:423 (1991); Letsinger et al., J. Am. Chem. Soc. 110:4470 (1988); Letsinger et al., Nucleoside & Nucleotide 13:1597 (1994); Chapters 2 and 3, ASC Symposium Series 580, “Carbohydrate Modifications in Antisense Research”, Ed. Y. S. Sanghui and P. Dan Cook; Mesmaeker et al., Bioorganic & Medicinal Chem. Lett. 4:395 (1994); Jeffs et al., J. Biomolecular NMR 34:17 (1994); Tetrahedron Lett. 37:743 (1996)) and non-ribose backbones, including those described in U.S. Pat. Nos. 5,235,033 and 5,034,506, and Chapters 6 and 7, ASC Symposium Series 580, “Carbohydrate Modifications in Antisense Research”, Ed. Y. S. Sanghui and P. Dan Cook. Nucleic acids containing one or more carbocyclic sugars are also included within the definition of nucleic acids (see Jenkins et al., Chem. Soc. Rev. (1995) pp 169 176). Several nucleic acid analogs are described in Rawls, C & E News Jun. 2, 1997 page 35. All of these references are hereby expressly incorporated by reference. Modifications of the ribose-phosphate backbone may be made to increase the stability and half-life of such molecules in physiological environments. For example, PNA:DNA hybrids can exhibit higher stability and thus may be used in some embodiments.

Target polynucleotides may be generated from a source nucleic acid, such as genomic DNA, cDNA (including cDNA libraries), cRNA (including cRNA libraries), siRNA (and siRNA libraries) and mRNA (as well as products of transcription and reverse transcription). In a preferred embodiment, target polynucleotides are generated from source nucleic acid by fragmentation to produce fragments of one or more specific sizes. This fragmentation may be accomplished by methods known in the art, including chemical, enzymatic and mechanical fragmentation. In one embodiment, the fragments are from about 50 to about 2000 nucleotides in length. In another embodiment, the fragments are from 50 to 600 nucleotides in length. In another embodiment, the fragments are 300 to 600 or 200 to 2000 nucleotides in length. In yet another embodiment, the fragments are 10-100, 50-100, 50-300, 100-200, 200-300, 50-400, 100400, 200-400, 400-500, 400-600, 500-600, 50-1000, 100-1000, 200-1000, 300-1000, 400-1000, 500-1000, 600-1000, 700-1000, 700-900, 700-800, 800-1000, 900-1000, 1500-2000, and 1750-2000 nucleotides in length. These fragments may in turn be circularized for use in an RCR reaction or in other biochemical processes, such as the insertion of additional adaptors.

For the methods of the present invention, it is preferable to utilize between 10-200 or more copies of substantially identical polynucleotide fragments in the creation of the constructs of the invention to ensure optimal coverage of the entire polynucleotide. For one aspect of the invention, the polynucleotide fragments may be obtained with a mixture of enzymes, e.g., a mixture of 2-20 restriction endonucleases, to provide multiple substantially identical copies of fragments from a sample comprising multiple copies of a target polynucleotide, e.g., the human genome. The restriction endonucleases for use in the fragmentation of the polynucleotides are preferably frequent 4-base cutters or special 2-base cutters with a combined frequency of one recognition site in every 10 to 300 bases in the target polynucleotide.

In a specific embodiment the preferred shifts (distances between starts of neighboring fragments) are between 10-300 bases, and a preferred fragment length for creation of the construct of about 500-10,000 nucleotides.

Multiple independent complete or partial DNA digestions can be used to obtain the desired number of copies of the polynucleotide fragments for construct construction. In specific aspects, the optimized approach is to perform separate reactions for each enzyme or for several small pools of enzymes. For example, between 4 and 8 separate reactions, each with mixture of 4 to 2 restriction enzymes, may be used in the preparation of the polynucleotides for construct creation.

In addition to natural or engineered restriction enzymes, other sequence specific cutting reagents may be used, alone or in combination with restriction endonucleases, in creating the polynucleotide fragments to be used in the constructs. One example of such an enzyme is using one base specific chemicals, such as dimethylsulfate, HCO₂H, hydrazine, and piperidine. In another example, oligonucleotide-defined cleavage sites can be used to fragment the polynucleotides instead of using restriction endonucleases. From 1-100, preferably 2-75, or even more preferably 5-35 million cutting anchors may be used in 2-10 pools. Such cutting anchors may optionally have a capture group for isolation of the fragments, and in certain aspects fragments of specific length range may be isolated after cleavage.

In certain aspects of the invention, both restriction endonucleases and oligonucleotides can be used in fragmentation of the polynucleotide for creation of the constructs of the invention. A smaller, selected set of oligonucleotides may be designed to complement restriction enzyme cutting and provide additional fragments in the low coverage areas or areas of specific interest, e.g., areas with specific disease-associated loci.

The nucleic acids may be single stranded or double stranded, as specified, or contain portions of both double stranded or single stranded sequence. The nucleic acids may be DNA, both genomic and cDNA, RNA or a hybrid, where the nucleic acid contains any combination of deoxyribo- and ribo-nucleotides, and any combination of bases, including uracil, adenine, thymine, cytosine, guanine, inosine, xathanine, hypoxathanine, isocytosine, isoguanine, etc.

“Target polynucleotides” and “target nucleic acids” comprise “target sequences”. As used herein, “target sequence” refers generally to a nucleic acid sequence on a single strand of nucleic acid. The target sequence may be a portion of a gene, a regulatory sequence, genomic DNA, cDNA, RNA including mRNA and rRNA, or others. As is outlined herein, the target sequence may be a target sequence from a sample, or a secondary target such as a product of an amplification reaction, a fragmentation reaction, and the like. A target sequence may be of any length. A target sequence often comprises a fragment of a target polynucleotide, and the length of that fragment may comprise some or all of the target polynucleotide from which it is derived. For a target sequence or a polynucleotide fragment to be “derived” from a target polynucleotide (or any polynucleotide) can mean that the target sequence/polynucleotide fragment is formed by physically, chemically, and/or enzymatically fragmenting a target polynucleotide (or any other polynucleotide). To be “derived” from a polynucleotide may also mean that the fragment is the result of a replication or amplification of a particular subset of the nucleotide sequence of the target polynucleotide.

The target sequence may also include a number of target domains, and these target domains may include the same or different sequences. For example, a first target domain of the sample target sequence may hybridize to a capture probe and a second target domain may hybridize to a label probe, etc. The target domains may be adjacent or separated as indicated. Unless specified, the terms “first” and “second” are not meant to confer an orientation of the sequences with respect to the 5′-3′ orientation of the target sequence. For example, assuming a 5′-3′ orientation of the complementary target sequence, the first target domain may be located either 5′ to the second domain, or 3′ to the second domain.

Adaptors

The invention preferably includes adaptors at spaced locations within a target polynucleotide or a fragment of a polynucleotide. As used herein, “adaptors” are nucleic acids of known sequence. Generally, adaptors are significantly shorter in length than the target polynucleotides into which they are inserted.

In accordance with the invention, adaptors may serve as platforms for interrogating adjacent sequences using various sequencing chemistries, such as those that identify nucleotides by primer extension, probe ligation, and the like. A unique component of embodiments of the invention is the insertion of known adaptor sequences into target polynucleotides, such that there is an interruption of contiguous target sequences with the adaptors. By sequencing both “upstream” and “downstream” of the adaptor, sequence information of entire target sequences may be accomplished. Adaptors can also be used in accordance with the invention to circularize polynucleotides.

Adaptors can be added to the ends of polynucleotide molecules—such adaptors are also referred to herein as “end adaptors”. Adaptors can also be “interspersed adaptors”, meaning that these adaptors are inserted into the “interior” of a polynucleotide molecule—i.e., interspersed adaptors separate two regions of a polynucleotide molecule, as described in U.S. application Ser. No. 11/679,124, which is hereby incorporated by reference. The adaptor may separate regions that are contiguous in the original polynucleotide or in the original genomic sequence from which the polynucleotide is derived. In another aspect, the adaptor may separate target sequence regions with known approximate or exact distance, including distance information for variations including bases deleted, repeated, etc.

In accordance with the invention, adaptors can include multiple features. Such features can include without limitation restriction endonuclease recognition sites, anchor probe hybridization sites (for use in analysis), sequencing probe hybridization sites, capture probe hybridization sites, and polymerase recognition sequences. Polynucleotide molecules that include adaptors with capture probe hybridization sites can be immobilized on a surface that contains capture probes through hybridization of the capture probes with the adaptors containing complementary capture probe hybridization sites.

In a preferred embodiment, adaptors include recognition sites for type IIs restriction endonucleases. Exemplary type IIs restriction endonucleases include, but are not limited to, Eco57M I, Mme I, Acu I, Bpm I, BceA I, Bbv I, BciV I, BpuE I, BseM II, BseR I, Bsg I, BsmF I, BtgZ I, Eci I, EcoP15 I, Eco57M I, Fok I, Hga I, Hph I, Mbo II, Mnl I, SfaN I, TspDT I, TspDW I, Taq II, and the like.

In some embodiments, each adaptor comprises the same Type IIs restriction endonuclease site. In alternative embodiments, different adaptors comprise different sites. In specific embodiments, one or more of the adaptors comprises two or more Type IIs restriction endonuclease sites, for use in bi-directional cutting or to provide additional specificity when introducing multiple adaptors.

In one embodiment of the invention, an adaptor can comprise a primer binding sequence. This primer binding sequence may be used, for example, to bind a primer for a polymerase. As is known in the art, in order to replicate a template, polymerases generally require a single stranded template (concatemers of the invention, for example), wherein the single stranded template includes a portion of double stranded nucleic acid. Essentially, any sequence can serve as a primer binding sequence to bind a primer, because any double stranded sequence will be recognized by the polymerase. In general, the primer binding sequence is from about 3 to about 60 nucleotides in length, with from about 15 to about 25 being preferred. Primer oligonucleotides are usually 6 to 25 bases in length. As will be appreciated by those in the art, the primer binding sequence can be contained within any other part of adaptor sequences. The primer binding sequence will hybridize to a complementary sequence on a primer, thus forming the requisite double stranded region for a polymerase to recognize and then replicate the remainder of the single stranded template.

In accordance with the invention, an adaptor can also comprise a capture probe recognition sequence. As is more fully outlined below, one embodiment of the invention utilizes capture probes on the surface of a substrate to immobilize polynucleotide molecules. The term “polynucleotide molecules” includes polynucleotides, target polynucleotides, target sequences and can also include other components such as adaptors. In one embodiment, the polynucleotide molecules include adaptors which comprise a domain sufficiently complementary to one or more capture probes to allow hybridization of the domain and the capture probe, resulting in immobilization of the polynucleotide molecule on the surface.

In one aspect, an adaptor comprises a secondary structure sequence. In a preferred aspect, adaptors include palindromic sequences or sequences complementary between adaptors, which foster intramolecular interactions within the target polynucleotide. For example, palindromic or complementary sequences in a plurality of adaptors within the concatemer can result in hybridization between adaptors (e.g., intramolecular interactions between copies in the concatemer) or within the adaptor itself e.g., resulting in hairpins. These structures can serve to “tighten” the three dimensional structure of the polynucleotide. In the case of concatemers formed from polynucleotides comprising adaptors, which are described in further detail below, palindromic or complementary sequences within the adaptors can provide a secondary structure that results in a more compact spheroid shape. These palindromic and/or complementary sequence units can be 5, 6, 7, 8, 9, 10 or more nucleotides in length and can be designed using a variety of different sequences. In one embodiment, palindromic sequences can be chosen to provide a specific melting temperature. In one exemplary embodiment, a palindrome AAAAAAATTTTTTT (SEQ ID NO: 1) will form a 14 base dsDNA hybrid with a neighboring unit that includes the complementary palindrome TTTTTTTAAAAAAA (SEQ ID NO: 2), resulting in a “local” region of double stranded DNA within the secondary structure of a single stranded polynucleotide molecules, such as a concatemer.

In one embodiment, an adaptor can comprise one or more binding sequences for a detectable tag, such as a label probe. In some embodiments, label probes can be added to the concatemers to detect particular sequences. Label probes will hybridize to the label probe binding sequence and comprise at least one detectable label. Such labels include without limitation the direct or indirect attachment of radioactive moieties, fluorescent moieties, calorimetric moieties, chemiluminescent moieties, and the like. Many comprehensive reviews of methodologies for labeling DNA and constructing DNA adaptors provide guidance applicable to constructing oligonucleotide probes of the present invention. Such reviews include Kricka, Ann. Clin. Biochem., 39: 114-129 (2002); Schaferling et al, Anal. Bioanal. Chem., (Apr. 12, 2006); Matthews et al, Anal. Biochem., Vol 169, pgs. 1-25 (1988); Haugland, Handbook of Fluorescent Probes and Research Chemicals, Tenth Edition (Invitrogen/Molecular Probes, Inc., Eugene, 2006); Keller and Manak, DNA Probes, 2nd Edition (Stockton Press, New York, 1993); and Eckstein, editor, Oligonucleotides and Analogues: A Practical Approach (IRL Press, Oxford, 1991); Wetmur, Critical Reviews in Biochemistry and Molecular Biology, 26: 227-259 (1991); Hermanson, Bioconjugate Techniques (Academic Press, New York, 1996); and the like. Many more particular methodologies applicable to the invention are disclosed in the following sample of references: Fung et al, U.S. Pat. No. 4,757,141; Hobbs, Jr., et al U.S. Pat. No. 5,151,507; Cruickshank, U.S. Pat. No. 5,091,519; (synthesis of functionalized oligonucleotides for attachment of reporter groups); Jablonski et al, Nucleic Acids Research, 14: 6115-6128 (1986)(enzyme-oligonucleotide conjugates); Ju et al, Nature Medicine, 2: 246-249 (1996); Bawendi et al, U.S. Pat. No. 6,326,144 (derivatized fluorescent nanocrystals); Bruchez et al, U.S. Pat. No. 6,274,323 (derivatized fluorescent nanocrystals); and the like.

In one embodiment, an adaptor can comprise one or more tagging sequences. In this embodiment, tagging sequences may be used to isolate and/or purify circularized target polynucleotides and concatemers from a mixture. In some embodiments, tagging sequences may include unique nucleic acid sequences that can be utilized to identify the origin of target sequences in mixtures of tagged samples, or can include components of ligand binding pairs, such as biotin/streptavidin, etc. Tagging sequences may also comprise a binding site for a detectable label, such as a fluorescently labeled probe.

In one aspect, multiple adaptors are included within a target polynucleotide or any other polynucleotide molecule. In one aspect, interspersed adaptors each have a length in the range of from about 4 to about 4000 nucleotides. In one embodiment, the interspersed adaptors have a length of from about 8 to about 60 nucleotides; in another embodiment, they have a length in the range of from 8 to 32 nucleotides; in embodiment aspect, they have a length in a range selected from about 4 to about 400 nucleotides; from about 10 to about 100 nucleotides, from about 400 to about 4000 nucleotides, from about 10 to about 80 nucleotides, from about 20 to about 70 nucleotides, from about 30 to about 60 nucleotides, and from about 4 to about 10 nucleotides. In a particularly preferred embodiment, interspersed adaptors with length from about 20 to about 30 bases are used in accordance with the invention.

The number of interspersed adaptors inserted into target polynucleotides may vary widely and depends on a number of factors, including the sequencing/genotyping chemistry being used (and its read-length capacity), the particular length of the cleavage site of a particular Type IIs site, the number of nucleotides desired to be identified within each target polynucleotide, whether amplification steps are employed between insertions, and the like.

In one aspect, a plurality of interspersed adaptors is inserted at separate sites of a target polynucleotide; this may include two, three, four or more interspersed adaptors that are inserted within the target polynucleotide. Alternatively, the number of interspersed adaptors inserted into a target polynucleotide ranges from 2 to 10; from 2 to 4; from 3 to 6; from 3 to 4; and from 4 to 6. In another aspect, interspersed adaptors may be inserted in one or both polynucleotide segments of a longer polynucleotide, e.g., 0.4-4 Kb in length, that have been ligated together directly or indirectly in a circularization operation (referred to herein as a “mate-pair”). In one aspect, such polynucleotide segments may be 4-400 (preferably 10-100) bases long.

It should also be noted that in general, the first adaptor attached to a target sequence is not “interspersed” or “inserted”. That is, the first adaptor is generally attached to one terminus of the fragmented target sequence, and the subsequent adaptors are interspersed within a contiguous target sequence.

Interspersed adaptors may in accordance with the invention be single or double stranded.

In some embodiments, adaptors can be used to create “classes” of polynucleotides. By “classes” is meant groups of polynucleotides that share a common feature—for example, such features can include source/sample of origin, length, amount of processing (including circularization, deletion, further fragmentation), as well as any other feature by which a particular group of polynucleotides can be differentiated from another group of polynucleotides. In one aspect, each member of a group of target polynucleotides has an adaptor with an identical anchor probe binding site and type IIs recognition site attached to a DNA fragment from source nucleic acid. In another embodiment, classes of polynucleotides may be created by providing adaptors having different anchor probe binding sites. Such classes may be created by providing adaptors having distinct sequences or features to differentiate among polynucleotides from different classes. For example, adaptors can comprise different anchor probe binding sites. This type of “clustering” can increase the efficiency of identifying and analyzing sequence information of the target polynucleotides.

In one embodiment, if a polynucleotide is “associated with” an adaptor, this can mean that the target polynucleotide is identified as being part of a “class” as discussed above. To be associated with an adaptor also generally refers to aspects of the invention in which an adaptor can be used to identify or tag a polynucleotide.

Interspersed adaptors are nucleic acid sequences that are inserted at spaced locations within the interior region of a target polynucleotide. In one aspect, “interior” in reference to a target polynucleotide means a site internal to a target polynucleotide prior to to processing, such as circularization and cleavage, that may introduce sequence inversions, or like transformations, which disrupt the ordering of nucleotides within a target polynucleotide. In one very specific aspect, interspersed adaptors are inserted at intervals within a contiguous region of a target polynucleotide. In some cases, such intervals have predetermined lengths, which may or may not be equal. In other cases, the spacing between interspersed adaptors may be known only to an accuracy of from one to a few nucleotides (e.g., from 1 to 15), or from one to a few tens of nucleotides (e.g., from 10 to 40), or from one to a few hundreds of nucleotides (e.g., from 100 to 200). In some cases about 1 to 4 bases of target polynucleotide may be deleted or duplicated in the process of adapter insertion. Preferably, the ordering and number of interspersed adaptors within each target polynucleotide is known. In some aspects of the invention, interspersed adaptors are used together with adaptors that are attached to the ends of target polynucleotides.

Circularizine Polynucleotide Molecules

In a preferred aspect, polynucleotides and portions of polynucleotides are ligated to adaptors and then circularized as preparation for use in other aspects of the invention described herein. Although many of the embodiments described herein refer to “polynucleotides” and “polynucleotide molecules”, these descriptions also apply to all other polynucleotide molecules described herein, including “target polynucleotides”, “target sequences”, “concatemers”, “target nucleic acids”, “nucleic acids”, “DNA nanoballs” and the like.

In one aspect, circularization of polynucleotide molecules can generally be described as follows (it should be noted that genomic DNA is used as an example herein, but is not meant to be limiting). Genomic DNA from any organism is isolated and fragmented into target polynucleotides using standard techniques. A first adaptor is ligated to one terminus of the target polynucleotide. The adaptor preferably comprises a Type IIs restriction endonuclease site, which cuts outside of the recognition sequence. If the enzyme results in a “sticky” end, the overhang portion can either be filled in or removed.

In one embodiment, an enzyme is used to ligate the two ends of the linear strand comprising the adaptor and the target polynucleotide to form a circularized nucleic acid. This may be done using a single step. Alternatively, a second adaptor can be added to the other terminus of the target polynucleotide (for example, a polyA tail), and then a bridging sequence can be hybridized to the two adaptors, followed by ligation. In either embodiment, a circular sequence is formed.

The circular sequence is then cut with the Type IIs endonuclease, resulting in a linear strand, and the process is repeated. This results in a circular polynucleotide with adaptors interspersed at well defined locations within previously contiguous target sequences.

If double stranded DNA is used, then the ends of the fragments may be prepared for circularization by “polishing” and optional ligation of adaptors using conventional techniques, such as employed in conventional shotgun sequencing, e.g., Bankier, Methods Mol. Biol., 167: 89-100 (2001); Roe, Methods Mol. Biol., 255: 171-185 (2004), which is hereby incorporated by reference.

In a preferred embodiment, target polynucleotide fragments of about 0.2 to about 2 kb in size are used in a circularization reaction. In a more preferred embodiment, the fragments are from about 0.3 to about 0.6 kb in size.

In most aspects of the invention, the preferred length of the polynucleotide for circularization is usually greater than 150 bases in length, more optimally greater than 400 bases in length. In specific embodiments, polynucleotides fragments of 100-1000 bases in length, more preferably 300-3000 bases in length, up to and including 30,000 bases or more in length may be used in the circularization methods of the invention. Adaptor length can be varied depending on the approximate or exact length of the polynucleotide insert(s) used to form the circles. For example, when a longer polynucleotide insert is used, adapter length is preferably between 10 to 100 bases, more preferably between about 20-30 bases. When a shorter polynucleotide insert is used, a longer adapter can be used to facilitate circle formation. For example, when the insert comprises two shorter sequences of defined length, e.g., a combined length of 24 nucleotides, the adaptor is preferably 100-500 nucleotides in length, more preferably 150 to 300 nucleotides in length.

In one embodiment, “adaptor segments” are used to circularize polynucleotides. In this embodiment, one portion of an adaptor is ligated to one end of a polynucleotide molecule and the remaining portion is ligated to the other end. The polynucleotide molecule is then circularized by ligating the two portions of the adaptor (the “adaptor segments”) to form a whole adaptor.

In one aspect, the invention utilizes a method of circularization as illustrated in FIG. 1. After genomic DNA (100) is fragmented and denatured (102), single stranded DNA fragments (104) are first treated with a terminal transferase (106) to attach a poly dA tails (108) to 3-prime ends. This is then followed by ligation (112) of the free ends intra-molecularly with the aid of a bridging oligonucleotide (110) that is complementary to the poly dA tail at one end and complementary to any sequence at the other end by virtue of a segment of degenerate nucleotides. A duplex region (114) of the bridging oligonucleotide (110) contains at least a primer binding site for RCR and, in some embodiments, comprises sequences that provide complements to a capture probe, which may be the same or different from the primer binding site sequence, or which may overlap with the primer binding site sequence. The length of capture probe may vary widely, In one aspect, capture probes and their complements in a bridging oligonucleotide have lengths in the range of from 10 to 100 nucleotides; and more preferably, in the range of from 10 to 40 nucleotides. Circular products (116) may be conveniently isolated by a conventional purification column, digestion of non-circular DNA by one or more appropriate exonucleases, or both.

In some aspects, the duplex region (114) may contain additional elements, such as an oligonucleotide tag, for example, for identifying the source nucleic acid from which its associated DNA fragment came. That is, in specific methods, circles or adaptor ligation or concatemers from different source nucleic acids may be prepared separately during which a bridging adaptor containing a unique tag is used, after which they are mixed for concatemer preparation or application to a surface to produce a random array. The associated fragments may be identified on such a random array by hybridizing a labeled tag complement to its corresponding tag sequences in the concatemers, or by sequencing the entire adaptor or the tag region of the adaptor.

In certain aspects of the embodiments, DNA circles prepared from source nucleic acid need not include an adaptor oligonucleotide. These circularized products can be used directly in the preparation of concatemers, as described in more detail herein.

Polynucleotide fragments can also be circularized using circularizing enzymes, such as CircLigase, a single stranded DNA ligase that circularizes single stranded DNA without the need of a template. CircLigase is used in accordance with the manufacturer's instructions (Epicentre, Madison, Wis.). In a preferred embodiment, single stranded polynucleotide circles comprising a DNA fragment and one or more adaptors are formed by using a standard ligase (such as T4 ligase) to ligate an adaptor to one end of DNA fragment. CircLigase is then used to close the circle.

Deletion Mate Pairs

In a preferred aspect, polynucleotide molecules of the invention comprise “deletion mate pairs”. As used herein, the term “deletion mate pair” refers to two target sequences that are adjacent in a construct (or adjoined in a construct by an introduced element such as an adaptor) but are separated by a known or expected distance within the genome or polynucleotide molecule from which they are derived. For example, as illustrated in FIG. 2, if a target polynucleotide (201) comprises contiguous domains X(202), Y(203) and Z(204), where the length of Y is known, then deletion of domain Y(203) results in two target sequences X (205) and Z (206) which are separated by a known distance Y. Target sequences X (205) and Z (206) in this case would be a deletion mate pair. Such deletion mate pairs may be contained within a linear or a circular polynucleotide molecule. It should be noted that FIG. 2 is only meant to illustrate conceptually what a deletion mate pair is, and does not necessarily reflect how a deletion mate pair is formed. Methods for forming deletion mate pairs are described in more detail for various aspects of the invention described below.

The deletion mate pair technique is particularly useful in determining the lengths and/or nucleotide sequence of repeating sequences within a target polynucleotide, a genome, a nucleotide library, and the like. Many sequencing techniques have relatively short read lengths, on the order of 1-20, 2-15, 4-10, and 6-8 bases. Since these shorter read lengths may not be able to sequence through long stretches of repeating sequence, such as repeating sequences that extend for 20, 30, 40, 50 or more bases, assembling a complete sequence from short read lengths can be difficult for several reasons, including without limitation because the endpoints of the repeating sequences cannot be determined. By utilizing two or more deletion mate pair constructs with overlapping fragments having deletions of known length, even a short read length of about 1-20, 2-15, 4-10, and 6-8 bases can be used to identify the length and/or nucleotide sequence of a target sequence. For example, as shown in FIG. 3, a target polynucleotide (301) is illustrated comprising a stretch of 16 repeating Ts flanked by non-repeating sequences (“B” represents any one of the four possible bases, A, C, C, T). Fragments 302-304 are fragments comprising deletion of specific length (12 nt) as represented by the underlined Xs. If two 6 base reads are obtained around 12-base deletion in the three overlapped reads, fragment (304) informs that the T repeats do not extend past its deletion area. Fragments (302) and (303) can be used to detect the 16 Ts: each fragment will identify the end of the T repeats and, using the known deletion, a 12-base read of each fragment can identify a 16 base T repeat (4 detected Ts+12 deleted Ts in (302) and 5 detected Ts+11 deleted Ts in (303)), which would not be possible without the use of such deletion fragments and mate pair construction. Thus, the identified sequences can be aligned using not only the bases that are identified in each sequence read, but also by the number of bases that are known to be deleted between two target domains.

The schematic illustration in FIG. 4A illustrates one method of producing a deletion mate pair construct. By “deletion mate pair construct” is meant a polynucleotide molecule that comprises one or more deletion mate pairs. A deletion mate pair construct may be a linear or a circular molecule. In the method illustrated in FIG. 4A, a circular polynucleotide molecule (401) is formed using methods described herein, and the circular polynucleotide molecule (401) comprises Adaptor A1 (402). Adaptor A1 preferably includes a recognition site for a restriction endonuclease (depicted as a black oval). In the embodiment illustrated in FIG. 4A, the recognition site in Adaptor A1 (402) is an “exact cutter”. By “exact cutter” is meant that the restriction endonuclease cuts at a known distance from the recognition site in all or most of the polynucleotide molecules. Some “wobbling” exists, so that even with an exact cutter there can be a very small (e.g., 1-10%) amount of cutting that happens one or two bases from the expected cutting site, but this number is small enough so as to not unduly change the fundamental methods of the invention. Exact cutter endonucleases include without limitation Type IIs restriction endonucleases such as Eco57M I, Mme I, Acu 1, Bpm 1, BceA 1, Bbv I, BciV I, BpuE I, BseM II, BseR I, Bsg I, BsmF I, BtgZ I, Eci I, EcoP15 I, Eco57M I, Fok I, Hga I, Hph I, Mbo II, Mnl I, SfaN I, TspDT I, TspDW I, Taq II, and the like.

In a preferred embodiment, the exact cutter used in forming a deletion mate pair construct is known to cut 6-30 bases from its recognition site; in a further embodiment the endonuclease cuts 8-16 bases from the recognition site, 10-14 bases from its recognition site.

As used herein, the term “recognition site” can be distinct from the term “cleavage site” for a restriction endonuclease. The endonuclease will generally “recognize” a particular sequence in a polynucleotide molecule—this is the recognition site. The point at which the endonuclease cleaves the polynucleotide molecule (the “cleavage site”) can either be within the recognition site or at some distance away from the recognition site.

As shown in FIG. 4A, the exact cutter is applied to the circular polynucleotide molecule (401) to form a first linear construct (403) in which Adaptor A1 (402) is interposed between a first target sequence T1 (406) and a second target sequence T2 (405). A deletion adaptor (404) is ligated to one end of the first linear construct (403) to form a second linear construct (407). A “deletion adaptor” is an adaptor that comprises a restriction endonuclease and is only used as a tool to delete a number of bases from a polynucleotide molecule (i.e., a deletion adaptor will not generally comprise other functional elements, such as hybridization sites for sequencing primers, etc., and will generally not be used for circularization of polynucleotides or in any other processing or analysis steps). In the embodiment depicted in this figure, the recognition site in the deletion adaptor is also for an exact cutter endonuclease. The exact cutter recognition site in Adaptor A1 (402) may be the same or different from the exact cutter recognition site in deletion adaptor (404).

The exact cutter is applied to the second linear construct (407), resulting in fragment (408) and a third linear construct (410). Since the number of bases “Y” deleted by the exact cutter is known, it is therefore also known that target sequence T1(406) and the now shorter target sequence T2 (409) are separated by Y bases. Taken together, T1 and the shorter T2 form a deletion mate pair, and construct (410) is a deletion mate pair construct. Step (411) further illustrates how T1 and T2 are separated by Y bases.

Preferably, in a method as illustrated in FIG. 4A, a deletion adaptor (404) is ligated only to the free end of the target sequence which needs to have bases deleted, i.e., to the free end of target sequence T2 (405) in the embodiment illustrated in FIG. 4 c. In such an embodiment, the target sequence that is not meant to have any bases deleted (T1 (406) will need to be modified to prevent the adaptor from ligating to its free end. In an exemplary embodiment, T1 (406) can be modified by method in which a recognition site for a nicking endonuclease is included in Adaptor A1(402). In this embodiment, T1 (406) will have a 3′ overhang, and application of the nicking endonuclease and nick translation by a polymerase will modify that overhanging end to a blunt end. Thus, only the unmodified end will have the overhang that is able to ligate with deletion adaptor (404). Such methods are described in U.S. Ser. No. 60/864,992 filed Nov. 9, 2006; Ser. No. 11/943,703, filed Nov. 2, 2007; Ser. No. 11/943,697, filed Nov. 2, 2007; Ser. No. 11/943,695, filed Nov. 2, 2007; and PCT/US07/835,540; filed Nov. 2, 2007, all of which are incorporated by reference in their entirety to teach this aspect.

In a further embodiment of the method illustrated in FIG. 4A, the process of applying an exact cutter endonuclease is repeated multiple times to generate deletions of a desired length. In a specific preferred embodiment a deletion of 12 bases in two repeated processes is preferred. In another embodiment, if a 24 base pair deletion is required, a restriction endonuclease that cuts exactly 8 base pairs away from the recognition site can be used 3 times to generate the desired deletion. In one embodiment, the polynucleotide molecules are purified between each of the multiple cycles of deletion using methods known in the art, including without limitation electrophoretic and sedimentation techniques. (see e.g., Sambrook, et al., Molecular Cloning, 2nd Edition, Cold Spring Harbor Laboratory Press, 1989).

In another embodiment, the multiple cycles of deletion are conducted as a “one pot” reaction in which the polynucleotides are not purified after each deletion reaction. A “one pot” reaction refers generally to the process of conducting multiple reactions on a sample without purifying the products of each reaction before beginning a subsequent reaction. In one such embodiment, reaction buffers can be adjusted to add ligase and endonuclease enzymes in the same reaction vessel with polynucleotide molecules and the adaptors. In such an embodiment, the reaction vessel will first contain polynucleotides and adaptors. The proper buffers for ligase are provided in the vessel, and the ligase is then added to ligate adaptors to polynucleotide molecules. Once the ligation reaction has proceeded for the desired amount of time, the ligase is inhibited using methods known in the art, preferably using increased temperature. The buffers are then adjusted for an endonuclease and the endonuclease is added to the reaction vessel without purification of the ligated molecules. After the endonuclease has cleaved the polynucleotide molecules ligated to adaptors comprising the proper recognition site, the endonuclease can be inhibited and the buffers again adjusted for either a new ligation reaction or for another endonuclease. In one embodiment, after the desired number of cycles of ligation and deletion are complete, the final polynucleotide products can be purified from the reaction mixture using methods known in the art.

In specific embodiments of the method illustrated in FIG. 4A, deletion adaptor (404) can be added as two arms, one ligated to each end of the linear construct (403). In this aspect, at least one of the two arms of deletion adaptor (404) comprises a recognition site for a restriction endonuclease that is an “exact cutter”. Use of such constructs is disclosed in U.S. Ser. No. 60/864,992 filed Nov. 9, 2006; Ser. No. 11/943,703, filed Nov. 2, 2007; Ser. No. 11/943,697, filed Nov. 2, 2007; Ser. No. 11/943,695, filed Nov. 2, 2007; and PCT/US07/835,540; filed Nov. 2, 2007, all of which are incorporated by reference in their entirety to teach this aspect.

In a further embodiment, the Adaptor A1 (402) and deletion adaptor (404) comprise recognition sites for the same restriction endonuclease. In such an embodiment, after the first linear construct is created, the recognition site in Adaptor A1 can be blocked using methods known in the art, such as DNA methylation, to prevent the endonuclease from cleaving from the site in Adaptor A1 again. Thus, when deletion adaptor (404) is ligated and the restriction endonuclease is again applied, the only point at which the endonuclease will cleave the polynucleotide molecule will be at the known distance from the recognition site in deletion adaptor (404), because the recognition site in Adaptor A 1 (402) has been blocked. Such an embodiment simplifies the process of adding adaptors to the polynucleotide molecule by minimizing the number of different adaptors that need to be designed and the number of restriction endonucleases that need to be used during the process of producing deletion mate pairs.

As is further described herein, the process of creating a deletion mate pair construct may utilize single stranded molecules or double stranded molecules. In some embodiments of the invention, certain steps of the methods for creating a deletion mate pair construct will utilize single stranded molecules, whereas other steps will utilize double stranded molecules.

In one embodiment, deletion mate pair constructs are circularized for further processing and analysis—for example, circular molecules can be used to generate concatemers, as is further described below. Methods of circularizing polynucleotide molecules are described herein. In one exemplary embodiment illustrated in FIG. 4B, the polynucleotide molecule (411) produced by the method illustrated in FIG. 4A is ligated to Adaptor A2 (412) to form construct (413). Construct (413) can then be circularized to form circular polynucleotide (414), in which Adaptor A1 and Adaptor A2 are interposed between target sequence T2 (now without Y bases) (409) and target sequence T1(406). Circular polynucleotides (414) can then be used in forming concatemers with a rolling circle replication reaction, as described herein. Such a concatemer would have repeating units in which each unit would comprise A1-T1-A3-T2. In addition, additional adapters can be inserted (e.g., in the shortened T2 region) using the restriction binding sites in A1 or A2.

In another exemplary embodiment, which is illustrated in FIG. 4C, a deletion mate pair construct (411), formed by the method discussed above and illustrated in FIG. 4A, is ligated to Adaptor A2 (412) and Adaptor A3 (413) to form construct (416). In a preferred embodiment, Adaptor A2 and Adaptor A3 comprise sites which can be utilized in sequencing reactions. For example, Adaptor A2 and A3 may comprise hybridization sites (e.g., for sequencing probes, anchor probes, primers and the like) and those hybridization sites can be utilized in the analysis of the nucleotide sequence of target sequence T1 (406) and at least a part of shortened target sequence T2 (409).

In one embodiment, the invention provides a method for creating a circularized deletion mate pair construct as illustrated in FIG. 5. In this method, a circular polynucleotide molecule (501) is provided which comprises a region X contiguous with Adaptor A, which is in turn contiguous with region Y, which is in turn contiguous with region Z. Regions X, Y and Z are contiguous within the polynucleotide from which they are derived. Adaptor A comprises two recognition sites for endonuclease enzymes. One restriction site is a recognition site for a Type IIs endonuclease (black oval), and one restriction site is a recognition site for an endonuclease that is not a Type IIs endonuclease (white oval). Although the Type IIs site is depicted as being to the “right” of the non-Type IIs endonuclease, the method is not limited to this configuration and the sites can be switched in position. Thus, in certain aspects of the invention the IIs restriction site is retained in the construct, and may be used again in subsequent reactions.

As shown in (502), the Type II restriction endonuclease is applied, cleaving the molecule at a point between region Y and region Z. In a preferred embodiment, the Type II restriction endonuclease is an exact cutter, and thus the cleavage site is at a known distance from the recognition site.

In step (503), the non-Type II endonuclease is applied, thus fully cleaving the Type II restriction site and region Y (504) out from the remainder of the polynucleotide molecule. In (505), region Z is then ligated to the remainder of Adaptor A, thus re-circularizing the molecule. This ligation may be accomplished by any method known in the art and discussed herein. Since the fragment (504) is of known length, regions X and Z are deletion mate pairs separated by the number of bases in region Y.

FIG. 6 illustrates a further embodiment in which multiple deletion cycles are applied to form the circular deletion mate pair construct. Again, a circular polynucleotide molecule (601) is provided which comprises a region X contiguous with Adaptor A, which is in turn contiguous with Y, which is contiguous with Z, which is contiguous with W. Regions X, Y, Z and W are contiguous within the polynucleotide from which they are derived. Adaptor A comprises two recognition sites—one for an exact cutter Type IIs restriction endonuclease (black oval), and one for a non-Type IIs restriction endonuclease (white oval). In this embodiment, the exact cutter is applied in (602) to cleave at a point between Y and Z. The non-Type IIs restriction endonuclease is then applied in (604), fully cleaving region Y (603) from the remainder of the polynucleotide molecule. In (605), Adaptor A is ligated to region Z to re-circularize the molecule. In (606), the exact cutter is again applied to cleave at a point between Z and W. In (608), the non-Type IIs restriction endonuclease is applied, allowing complete removal of Z (607). In (609), W is ligated to Adaptor A, again re-circularizing the molecule. Since both applications of the Type IIs restriction endonuclease deleted a known number of bases, X and W in (609) are deletion mate pairs separated by the number of bases in regions Y plus Z. The stepwise deletion method described in FIG. 6 can be used to generate a library of deletion mate pair constructs from one or more target polynucleotides, in which the library comprises constructs that have undergone different numbers of deletion cycles. When the deletion mate pairs from such a library are sequenced, the sequence reads can be efficiently assembled using not only overlapping sequences but also the different deletion regions.

FIG. 7 illustrates an embodiment of the invention in which multiple deletion cycles are performed on a linear molecule. Construct (701) comprises regions Z, W and V on one end and a region X and Adaptor A1 on the other end. Fragments for use in multiple deletion cycles are preferably prepared by partial restriction digestion using one or more enzymes with frequent recognition sites. Regions Z, W and V are contiguous within the polynucleotide molecule from which they are derived. In (703), deletion adaptor (702) is ligated to one end of construct (701). Deletion adaptor (702) comprises a recognition site for an exact cutter endonuclease. The exact cutter is applied to cleave between region Z and W, resulting in fragment (704) and the shortened construct (705), which now has had region Z removed. In (707), deletion adaptor (706) is ligated to the construct (705). The deletion adaptor (706) also comprises a recognition site for an exact cutter, and this recognition site may be the same or different from the recognition site in deletion adaptor (702). The exact cutter is applied to form fragment (708) and the further shortened construct (709). Region V and X in (709) are deletion pairs. As discussed above for the method of FIG. 6, the method illustrated in FIG. 7 can be used to generate a library of constructs that have undergone different numbers of deletion cycles.

FIG. 8 illustrates another embodiment of forming deletion mate pair constructs that have undergone multiple deletion cycles. In this embodiment, a linear polynucleotide molecule (801) comprises regions Z, W and V on one end and region X on the other end. In a preferred embodiment, molecule (801) is a fragment of a polynucleotide formed according to methods known in the art and described herein. Molecule (801) is ligated to Adaptor A1 in (802). Preferably, Adaptor A1 is only ligated to one end of molecule (801). In a preferred embodiment, multiple copies of molecule (801) undergo the ligation reaction in (802). In such an embodiment, an aliquot of the ligated molecule in (802) can be circularized using methods described herein to form circularized molecule (803). Another aliquot is ligated to a deletion adaptor (DA) in (804). The deletion adaptor comprises a recognition site for an exact cutter (black oval). The exact cutter is applied to form fragment (805) and shortened construct (806). Again, an aliquot of (806) can be circularized to form (807). Another aliquot of (806) can be ligated to a deletion adaptor to form the construct in (808). The deletion adaptor in (808) also comprises a recognition site for an exact cutter. The recognition site in (808) can be the same or different than the recognition site in (804). The exact cutter is applied to form fragment (809) and further shortened construct (810). Construct (810) can in turn be circularized to form construct (811). Constructs (806), (807), (810) and (811) all comprise deletion mate pairs. Libraries of these constructs can be used in sequencing reactions as described herein, and the resultant sequence reads can be more efficiently aligned than is possible with traditional sequencing reactions, because the alignments can be based on not only overlapping sequences but also on the various deleted regions. Libraries formed from constructs made according to the methods illustrated in FIG. 8 may include only the linear constructs or only the circularized constructs, or they can comprise a mixture of both linear and circular constructs. Additional adaptors can be inserted, preferably in the region ZWV and adjacent regions, using restriction binding sites on one or both sides of A1.

FIG. 9 illustrates a further embodiment of the methods illustrated in FIGS. 7 and 8, in which a second adaptor is ligated to the ends of deletion mate pair constructs that have been formed using different numbers of deletion cycles. Both adaptors in these constructs are preferably used in sequencing reactions described herein—having two adaptors in the molecule increases the number of bases that can be read in a single sequencing reaction, by providing two (or more) different points of origin for such sequencing reactions. Sequencing reactions, particularly sequencing reactions utilizing adaptors incorporated into polynucleotide molecules, are described further herein.

FIG. 10 illustrates another embodiment in which a deletion cycle is performed. In this embodiment, circularized construct (1001) comprises Adaptor A1 interposed between region X and region Y. Regions Y, Z and W are contiguous within the polynucleotide from which they are derived. Adaptor A1 comprises a recognition site for an exact cutter (black oval). The exact cutter is applied in (1002) to cleave (1002) between region Y and region Z to form linear construct (1003). A deletion adaptor (DA) is ligated to form construct (1004). The deletion adaptor comprises a recognition site for an exact cutter (black oval). The recognition site in construct (1004) may be the same or different from the recognition site in Adaptor A1 of construct (1002). The exact cutter is applied to form fragment (1005) and shortened construct (1006). Construct (1006) can then be ligated to Adaptor A2 to form construct (1007), or alternatively construct (1006) can be circularized to form construct (1008). Both constructs (1007) and (1008) can be used in sequencing reactions to identify the sequences of regions W, X and Y in (1007) and the sequences of regions X and Y in construct (1008), e.g., a sequencing method can be used to read Y and at least a part of W using A1 in (1008), e.g., a sequencing method can be used to read Y and at least a part of W using A1 in (1008).

FIG. 11 illustrates another method for forming deletion mate pair constructs using combinations of deletion cycles. In this embodiment, construct (1101) comprises Adaptor A interposed between region X and region Y. Regions W, X, Y, and Z are contiguous within the polynucleotide from which they are derived. Adaptor A comprises four recognition sites for restriction endonucleases (ovals). The dark ovals represent recognition sites for exact cutters, whereas the white ovals represent recognition sites for non-Type IIs restriction endonucleases. In (1102), the exact cutter for the recognition site represented by the black oval is applied. The endonuclease for the recognition site next to the site represented by the black oval is then applied to completely cleave region Y (1104) from the remainder of the construct (1103). Construct (1103) is then circularized to form construct (1105). The exact cutter for the recognition site represented by the gray oval is applied to construct (1105) to form construct (1106). The restriction endonuclease for the recognition site next to the recognition site represented by the gray oval is in turn applied to construct (1106) to completely cleave region X (1108) from the remainder of the construct (1107). Construct (1107) is then circularized to form construct (1109). Because the deleted regions were of known length (because the exact cutters cleave at a known number of bases away from their recognition sites), regions W and Z in construct (1109) are deletion mate pairs.

FIG. 12 illustrates a version of the method described above for FIG. 11, except that the method illustrated in FIG. 12 begins with linear polynucleotide molecule (1201). Molecule (1201) is preferably a fragment of a target polynucleotide formed according to methods described herein. Each end of molecule (1201) is ligated to a deletion adaptor (DA). The deletion adaptors comprise recognition sites for exact cutters (ovals), and the recognition sites in each deletion adaptor may the same or different from the other. The exact cutters for both deletion adaptors are applied to form fragments (1203) and (1204) and shortened construct (1205). Construct (1205) is then ligated to Adaptor A to form construct (1207). Construct (1207) can then be circularized to form construct (1208), in which adaptor A is interposed between the deletion mate pair comprising region X and region V.

In a very specific aspect of the invention (not shown), circular nucleic acid constructs without an adaptor can be used to identify a number of bases on either side of a specific deletion. This aspects can be performed by: providing a circularized fragment of a target polynucleotide, with the two ends of the fragment were are not contiguous in the target polynucleotide joined in the circular construct; cleaving the circle at a site substantially distant from the site of the joined fragment ends; deleting a specified number of bases at the cleavage site; and identifying a selected number of bases at each end of the deletion site. Identification of these bases generally involves one or more adapters having sequencing reaction binding sites (e.g., for primers, anchors or probes) ligated to such fragment ends.

FIG. 13 illustrates yet another embodiment of the invention. In this embodiment, construct (1301) comprises Adaptor A interposed between regions X and Z. X and Z are deletion mate pairs formed by any of the methods described above. Adaptor A comprises two recognition sites for exact cutters (black ovals). The recognition sites may be the same or different. The exact cutter(s) for both recognition sites is applied to construct (1301) in (1302) to form molecule (1303) and construct (1304). Construct (1304) consists of region X and Z and Adaptor A. Construct (1304) is then circularized to form construct (1305), in which regions X and Z are now contiguous within the circular molecule. In this embodiment, Adaptor A is longer than adaptors generally used in methods of the invention, so that construct (1304) is of sufficient length to form the circular construct (1305). Preferably, Adaptor A in this embodiment is at least about 20-40 bases in length. In another embodiment, Adaptor A is at least 50 bases in length, at least 100 bases in length at least 150 bases in length, at least 200 bases in length, at least 500 bases in length, and the like.

In any of the methods discussed above involving multiple deletion cycles, the number of deletion cycles is not limited to the numbers illustrated in the exemplary figures. The steps of ligating a deletion adaptor to a construct and cleaving with an exact cutter can be repeated multiple times to form constructs with increasing numbers of bases deleted.

In general, the polynucleotide molecules used in generating deletion mate pair constructs are at least partially double stranded throughout the process. Once the final linear or circular construct containing the deletion mate pair is formed, the double stranded molecule can be separated into single stranded molecules using methods known in the art. (see, e.g., Erlich, ed., PCR Technology, Principles and Applications for DNA Amplification, (W.H. Freeman and Co, New York, 1992), Chapter 7). For example, the double stranded molecules can be denatured using heat or high pH to “melt” the strands and cause them to separate. Double stranded polynucleotide molecules can also be denatured by using a denaturing polyacrylamide gel and isolating the single stranded molecules using methods known in the art.

Circularization of double-stranded polynucleotides generally requires a polynucleotide of longer than about 150 bases and preferably longer than 300 bases. If target polynucleotide is short, a longer adaptor may be used to facilitate the circularization of the construct. Due to its greater flexibility, much shorter circles of single-stranded polynucleotides can be formed (e.g., fragments as short as 50-150 bases.) In specific aspects, constructs comprising longer strands of double-stranded polynucleotides can be used for adaptor insertion, and a portion of this double-stranded polynucleotide can be removed prior to the formation of single-stranded circular constructs. This can allow a large number of targets to be inserted into such constructs.

In another embodiment, one or more steps of the methods for forming deletion mate pair constructs utilize single stranded molecules. In such embodiments, the single stranded molecule can to be rendered partially double stranded in order to ligate adaptors to the polynucleotide molecule or to create a recognition site for a restriction endonuclease. A single stranded molecule can be rendered partially double stranded by using oligonucleotides of 10-30 base pairs in length, which have sequences complementary to part of the sequence of the single stranded molecule. The oligonucleotides will hybridize to the corresponding sequence, thus creating a “localized” double stranded region on the otherwise single stranded molecule. In some embodiments, a single stranded linear construct can be circularized using CircLigase™ as is further discussed herein.

As described herein, adaptors may include many functional elements, such as recognition sites for restriction endonucleases, sites for primers, and anchor probe hybridization sites. As used herein, adaptors utilized in the methods of creating a deletion mate pair construct may also be referred to herein by their roles within such methods. For example, an adaptor used to circularize a polynucleotide can be referred to as a “circularization adaptor”. Similarly, an adaptor used to delete a number of bases from a construct can be referred to as a “deletion adaptor”. These adaptors are not limited to only a single structure or function. For example, a circularization adaptor may also be as a deletion adaptor if it comprises a restriction endonuclease recognition site. Similarly, a deletion adaptor may also be used to circularize a polynucleotide molecule.

In another aspect, a library of linear or circular deletion mate pair constructs can be created from a sample, such as a genomic sample, or isolated from target polynucleotides or fragments of polynucleotides. In one embodiment, the library is enriched for polynucleotide molecules of a specific length by first separating DNA fragments using methods known in the art, including without limitation polyacrylamide gel purification. In another embodiment, a library comprising two or more different kinds of mate pairs is generated. For example, a first library can be created using only a single application of a deletion step, a second library using two applications of the deletion step, and so forth. Alternatively, libraries can be generated by using different numbers of deletion cycles, by using different combinations of endonucleases, including without limitation different exact cutters, different non-exact cutters, and combinations thereof. Libraries can also be generated by using different combinations of numbers of deletion cycles, different endonucleases, and other methods of generating deletion mate pair constructs of different lengths, sequences and structures. Such libraries can be analyzed separately or tagged and combined as a mixture into a single library. When analyzed as a mixture, the analysis can include detection (such as sequencing) of the tags. Such tags can include without limitation detectable labels, such as fluorescent labels, which can identify constructs based on properties that include without limitation length, type of deletion, numbers of deletion cycles, and the like.

In one aspect, deletion mate pair constructs made according to the invention comprise target sequences. A plurality of deletion mate pair constructs, such as a library of deletion mate pair constructs, can include enough different target sequences to cover (i.e., represent) part or all of a source nucleic acid, including without limitation a target polynucleotide, a genome, a cDNA library, and the like. Such a plurality of deletion mate pair constructs may cover about 5% to about 100% of a source nucleic acid, about 1% to about 95%, about 5% to about 90%, about 10% to about 85%, about 15% to about 80%, about 20% to about 75%, about 25% to about 70%, about 30% to about 65%, about 35% to about 60%, about 40% to about 55%, and about 45% to about 50% of the source nucleic acid. In a preferred embodiment, a plurality of deletion mate pair constructs comprises target sequences which together represent about 80% of a source nucleic acid. In a particularly preferred embodiment, a plurality of deletion mate pair constructs comprises target sequences which together represent about 80% of a genome.

Deletion mate pair constructs made according to the invention can be further processed and analyzed as is described in detail below.

Generating Nested Fragments with Exact End-Deletions

Multiple target polynucleotide fragments with exact end-deletions relative to the entire polynucleotide can be used for determining the length of simple repeats, for example 10-30 TC repeats or a poly-A repeat. Long and highly overlapped nucleic acid fragments created from a target polynucleotide can be prepared by partial digestion with one or a pool of frequently-cutting restriction enzymes as described. Preferably, the use of one or multiple restriction enzymes cleaves the target polynucleotide at approximately every 50-200 bases, resulting in fragments of ˜1 kb to 10 kb in length that each begin at predefined restriction enzyme recognition sequence sites. A pooled group of these fragments can then be used in the creation of constructs for use in sequence determination of the target polynucleotide.

The initial target polynucleotide fragments created are sequentially deleted using consecutive cycles of ligation of an adapter with IIS restriction enzyme binding site at the ends of the fragments, and cleavage of the ends of the fragments to create a deletion of known length from each end. Such ligation and deletion reactions can occur in multiple reactions or, preferably, in a single-tube reaction. Sequential deletion of a defined number of bases from the ends of the fragments generates “nested” fragments from the target polynucleotide fragments.

Using nested fragments in sequencing reactions allows determination of sequences that are separated by an exact number of bases in the target polynucleotide. This facilitates determination of an exact length of tandem repeats such as mono, double and triple repeats that are located between these fragment ends. A schematic example using a target polynucleotide fragment and a nested fragment known to be exactly 60 bases shorter in length than the fragment is depicted below, with each “ . . . ” representing a deletion of known length:

CATGBBBBBBBBAAAAAAAAAAABBB . . . BBBBBBBBBB . . . BBBCATGBBBBB . . . (SEQ ID NO: 3) Target Fragment                                      BBBBBB . . . BBBCATGBBBBB . . . (SEQ ID NO 4) Nested Fragment |-------------------------60 bases ----------------------------| |--12 bases----------|                         _(—) _(—) _(—) _(—)_|−“bases-|

When these fragments are used in sequence assembly of the entire target polynucleotide, the total number of bases surrounding the polyA repeat that are present in the target fragment and not present in the nested fragment (here, 12+37=49 bases) to determine the length of the polyA repeat. Using comparison of the target fragment and the nested fragment, the total number of deleted bases is known to be 60. Thus, the polyA repeat length can be determined to be 60−49=11 bases.

Libraries created using the nested fragment methods allows “linked” reads to be obtained for sequence assembly, and the use of predefined mapping sites and reference sequences for each restriction enzyme used in the methods. For example, initial mapping of a fragment to the sequence of a target polynucleotide may use a predefined number mapping sites at a predefined distance from each restriction site. Use of nested fragments of the invention has the potential of reducing mapping computation of larger polynucleotides (such as an entire genome) 10-100 fold, and may eliminate the need for fragment size selection in the preparation of mate pairs. Moreover, a combination of deletion mate-pairs and conventional mate-pairs can be used to determine evidence of sequence location, presence of mutations, and the like.

Another advantage of using nested fragments generated by restriction enzymes (as opposed to random fragmentation) is that use of such specific fragments in the creation of the nested fragments can provide a more constrained data set for comparative analysis across individuals.

Amplicons

In one aspect of the invention, polynucleotides of the invention are used to generate amplicons. The term “amplicon” means the product of a polynucleotide amplification reaction. That is, it is a population of polynucleotides that are replicated from one or more starting polynucleotides, e.g., either the linear or the circular constructs of the present invention. Amplicons may be produced by a variety of amplification reactions, including but not limited to polymerase chain reactions (PCRs), linear polymerase reactions, nucleic acid sequence-based amplification, circle dependent amplification and like reactions (see, e.g., U.S. Pat. Nos. 4,683,195; 4,965,188; 4,683,202; 4,800159; 5,210,015; 6,174,670; 5,399,491; 6,287,824 and 5,854,033; and US Pub. No. 2006/0024711).

In one aspect, the invention provides concatemers generated from polynucleotide molecules. Such concatemers contain multiple copies of a target polynucleotide or a fragment of a target polynucleotide. DNA concatemers under conventional conditions (a conventional DNA buffer, e.g., TE, SSC, SSPE, or the like, at room temperature) form random coils that roughly fill a spherical volume in solution having a diameter of from about 100 to 300 nm, which depends on the size of the DNA and buffer conditions, in a manner well known in the art, e.g., Drmanac et al., U.S. patent application Ser. No. 11/451,691; Drmanac et al., U.S. patent application Ser. No. 11,451,692; Edvinsson, “On the size and shape of polymers and polymer complexes,” Dissertation 696 (University of Uppsala, 2002).

Concatemers, particularly concatemers with a secondary structure such as a random coil, are also referred to herein as “DNA nanoballs” (“DNBs”).

As discussed herein, target polynucleotides may be generated from a source nucleic acid, such as genomic DNA, cDNA (including cDNA libraries), CRNA (including cRNA libraries), siRNA (and siRNA libraries) and mRNA (as well as products of transcription and reverse transcription).

Although many of the following descriptions focus on DNA molecules, the invention is not limited to DNA polynucleotide molecules, and the following methods apply to other types of polynucleotides, including without limitation mRNA, siRNA, and cRNA.

In many cases, enzymatic digestion of the source nucleic acid, particularly genomic DNA, is not required because shear forces created during lysis and extraction will generate fragments in the desired range. In another embodiment, shorter fragments (1-5 kb) can be generated by enzymatic fragmentation using restriction endonucleases. In one embodiment, 10-100 genome-equivalents of DNA ensure that the population of fragments covers the entire genome. In some cases, it is advantageous to provide carrier DNA, e.g., unrelated circular synthetic double-stranded DNA, to be mixed and used with the sample DNA whenever only small amounts of sample DNA are available and there is danger of losses through nonspecific binding, e.g., to container walls and the like. In one embodiment, the DNA is denatured after fragmentation to produce single stranded fragments.

In addition to target polynucleotides or portions of target polynucleotides, concatemers of the invention in a preferred embodiment also include interspersed adaptors that permit acquisition of sequence information from multiple sites, either consecutively or simultaneously. In this embodiment, interspersed adaptors comprise hybridization sites for sequencing probes, allowing for detection and identification of nucleotides in adjacent detection positions at numerous points along the target polynucleotide molecule. Since interspersed adaptors are interspersed throughout the polynucleotide molecule, a long target polynucleotide can be sequenced using short sequence reads, because the sequencing reactions have multiple “starting points” in the multiple interspersed adaptors.

In a preferred aspect, rolling circle replication (RCR) (is used to create concatemers of the invention. The RCR process has been shown to generate multiple continuous copies of the M13 genome. (Blanco, et al., (1989) J Biol Chem 264:8935-8940). In this system, as illustrated in FIGS. 2 and 3, the desired polynucleotide fragment is replicated by linear concatemerization. Guidance for selecting conditions and reagents for RCR reactions is available in many references available to those of ordinary skill, as evidence by the following, which are each incorporated by reference: Kool, U.S. Pat. No. 5,426,180; Lizardi, U.S. Pat. Nos. 5,854,033 and 6,143,495; Landegren, U.S. Pat. No. 5,871,921; and the like.

Generally, RCR reaction components include single stranded DNA circles, one or more primers that anneal to DNA circles, a DNA polymerase having strand displacement activity to extend the 3′ ends of primers annealed to DNA circles, nucleoside triphosphates, and a conventional polymerase reaction buffer. Such components are combined under conditions that permit primers to anneal to DNA circle. Extension of these primers by the DNA polymerase forms concatemers of DNA circle complements.

Preferably, concatemers produced by RCR are approximately uniform in size; accordingly, in some embodiments, methods of making arrays of the invention may include a step of size-selecting concatemers. For example, in one aspect, concatemers are selected that as a population have a coefficient of variation in molecular weight of less than about 30%; and in another embodiment, less than about 20%. In one aspect, size uniformity is further improved by adding low concentrations of chain terminators, such ddNTPs, to the RCR reaction mixture to reduce the presence of very large concatemers, e.g., produced by DNA circles that are synthesized at a higher rate by polymerases. In one embodiment, concentrations of ddNTPs are used that result in an expected concatemer size in the range of from 50-250 Kb, or in the range of from 50-100 Kb. In another aspect, concatemers may be enriched for a particular size range using a conventional separation techniques, e.g., size-exclusion chromatography, membrane filtration, or the like.

The RCR process relies upon the desired target molecule first being formed into a circular substrate. This linear amplification uses the original DNA molecule, not copies of a copy, thus ensuring fidelity of sequence. As a circular entity, the molecule acts as an endless template for a strand displacing polymerase that extends a primer complementary to a portion of the circle. The continuous strand extension creates long, single-stranded DNA consisting of hundreds of concatemers comprising multiple copies of sequences complementary to the circle.

FIG. 14 illustrates one aspect of the embodiments for creating concatemers for use in the invention. In this embodiment, source nucleic acid (1400) is treated (1401) to form single stranded fragments (1402), preferably in the range of from 50 to 600 nucleotides, and more preferably in the range of from 300 to 600 nucleotides. Individual fragments of source nucleic acid 1406 are then ligated (1403) to adaptors (1404) to form a population of adaptor-fragment conjugates (1405). Source nucleic acid (1400) may be genomic DNA extracted from a sample using conventional techniques, or a cDNA or genomic library produced by conventional techniques, or synthetic DNA, or the like. Treatment (1401) usually entails fragmentation by a conventional technique, such as chemical fragmentation, enzymatic fragmentation, or mechanical fragmentation, followed by denaturation to produce single stranded DNA fragments.

Adaptors (1404), in this example, are used to form (1408) a population (1410) of DNA circles by the method illustrated in FIG. 14. In one aspect, each member of population (1410) has an adaptor with an identical primer binding site and a DNA fragment (1406) from source nucleic acid (1400). As discussed above, the adaptor also may have other functional elements including, but not limited to, tagging sequences, attachment sequences, palindromic sequences, restriction sites, functionalization sequences, and the like. In other embodiments, classes of DNA circles may be created by providing adaptors having different primer binding sites.

After DNA circles (1410) are formed, a primer and rolling circle replication (RCR) reagents may be added to generate (1411) in a conventional RCR reaction (1412) concatemers (1413) of the complements of the adaptor oligonucleotide and DNA fragments, which population can then be isolated using conventional separation techniques. Performing this for multiple circles (1414) results in a population of concatemers (1415) for construction of arrays of the invention.

In a specific aspect, primers used for RCR may be selected to match target sequences within the DNA fragments rather than in the adaptor. In such an embodiment, the concatemers produced will produce a set of DNA circles which preferentially include these target sequences.

Alternatively, amplification of the circular nucleic acids may be implemented by successive ligation of short oligonucleotides, e.g., 6-mers, from a mixture containing all possible sequences, or if circles are synthetic, a limited mixture of these short oligonucleotides having selected sequences for circle replication, a process known as “circle dependent amplification” (CDA). “Circle dependant amplification” or “CDA” refers to multiple displacement amplification of a double-stranded circular template using primers annealing to both strands of the circular template to generate products representing both strands of the template, resulting in a cascade of multiple-hybridization, primer-extension and strand-displacement events. This leads to an exponential increase in the number of primer binding sites, with a consequent exponential increase in the amount of product generated over time. The primers used may be of a random sequence (e.g., random hexamers) or may have a specific sequence to select for amplification of a desired product. CDA results in a set of concatemeric double-stranded fragments is formed.

Concatemers may also be generated by ligation of target DNA in the presence of a bridging template DNA complementary to both beginning and end of the target molecule. A population of different target DNA may be converted in concatemers by a mixture of corresponding bridging templates.

In a preferred embodiment, a subset of a population of DNA circles may be isolated based on a particular feature, such as a desired number or type of adaptor. This population can be isolated or otherwise processed (e.g., size selected) using conventional techniques, e.g., a conventional spin column, or the like, to form a population from which a population of concatemers can be created using techniques such as RCR.

As illustrated in 15, in certain embodiments, DNA circles prepared from source nucleic acid (1500) need not include an adaptor oligonucleotide. As before, source nucleic acid (1500) is fragmented and denatured (1502) to form a population of single-stranded fragments (1504), preferably in the size range of from about 50 to 600 nucleotides, and more preferably in the size range of from about 1500 to 600 nucleotides, after which they are circularized in a non-template driven reaction with circularizing ligase, such as CircLigase (Epicentre Biotechnologies, Madison, Wis.), or the like. After formation of DNA circles (1506), concatemers are generated by providing a mixture of primers that bind to selected sequences. The mixture of primers may be selected so that only a subset of the total number of DNA circles (1506) generates concatemers. For example, primers can be selected to target certain exon sequences, thus enriching the population of DNA circles with these exon sequences. Primers used in this aspect may, as described herein, include a tail sequence. In one embodiment, the primers all share an identical tail sequence (also referred to herein as a “tail oligonucleotide”). In another embodiment, a group of tailed primers will include multiple different tail sequences. Generating concatemers for multiple circles results in a population of concatemers, and the desired concatemers isolated (1510), resulting in a population of concatemers (1512).

In one aspect, once concatemers are immobilized to a surface, the primers can be extended using a non-strand displacing polymerase to form sets of copies of the concatemers that are individually attached to the surface, and the concatemer template can be removed to obtain single stranded DNA using hybridization methods known in the art. For example, removal can comprise without limitation, methods including: nicking the concatemer, cutting the concatemer using ssDNA nuclease or other enzyme at the gaps between two units of the concatemer, or, selective degradation of a single-stranded template. For example, if uracils are used in preparation of the concatemer, these uracils can be degraded to form the single stranded DNA. Any of these methods of removing the concatemer can be combined with DNA digestion by a 5′ exonuclease, with denaturizing agents, or with some combination thereof. Removing the concatemer after creating multiple copies of complementary sequences is particularly useful in aspects of the invention for target sequence analyses and other assays where multiple individually attached copies of the target polynucleotide are desirable.

After concatemers are generated, e.g., using the above-described methods, they can be isolated and applied to surface for the formation of a random array of the invention. FIG. 16 illustrates the creation of concatemers and disposition of these concatemers onto arrays, where they can subsequently be amplified using the methods of the invention to create arrays of the invention. Source nucleic acids (1600) are fragmented (1603) and the individual fragments (1606) are ligated (1605) to adaptors (1604) for circularization (1608), after which the population of circularized nucleic acids (1610) are formed (1612) into concatemers (1614) by RCR. The population of desired concatemers (1618) are then isolated (1616) and applied (1620) to a surface (1622) for creation of an array of first stage amplicons (1624).

Methods of Amplification

Any polynucleotide molecules of the invention, including polynucleotides, target polynucleotides, target sequences, and concatemers, can be amplified using methods known in the art and described herein. Such methods of amplification can generally be accomplished in solution or in situ (i.e., on a surface).

Suitable amplification methods include both target amplification and signal amplification and include, but are not limited to, polymerase chain reaction (PCR), ligation chain reaction (sometimes referred to as oligonucleotide ligase amplification OLA), cycling probe technology (CPT), strand displacement assay (SDA), transcription mediated amplification (TMA), nucleic acid sequence based amplification (NASBA), rolling circle amplification (RCA), and invasive cleavage technology. All of these methods require a primer nucleic acid (including nucleic acid analogs) that is hybridized to a target sequence to form a hybridization complex, and an enzyme is added that in some way modifies the primer to form a modified primer. For example, PCR generally requires two primers, dNTPs and a DNA polymerase; LCR requires two primers that adjacently hybridize to the target sequence and a ligase; CPT requires one cleavable primer and a cleaving enzyme; invasive cleavage requires two primers and a cleavage enzyme; etc. Thus, in general, a target nucleic acid is added to a reaction mixture that comprises the necessary amplification components, and a modified primer is formed. Methods of amplification and detecting the products of amplification are discussed at length in U.S. Patent Publication No. 2006/0275782, which is hereby incorporated in its entirety for all purposes.

The methods of amplification described in this and following sections are often preludes to sequencing reactions, and often sequencing reactions incorporate an amplification step, as is also described further herein.

Strand Displacement Amplification

Strand displacement amplification (SDA) is generally described in Walker et al., in Molecular Methods for Virus Detection, Academic Press, Inc., 1995, and U.S. Pat. Nos. 5,455,166 and 5,130,238, all of which are hereby incorporated by reference.

In general, SDA may be described as follows. A single stranded target nucleic acid, usually a DNA target sequence, is contacted with an SDA primer. An “SDA primer” generally has a length of 25-100 nucleotides, with SDA primers of approximately 35 nucleotides being preferred. An SDA primer is substantially complementary to a region at the 3′ end of the target sequence, and the primer has a sequence at its 5′ end (outside of the region that is complementary to the target) that is a recognition sequence for a restriction endonuclease, sometimes referred to herein as a “nicking enzyme” or a “nicking endonuclease. The SDA primer then hybridizes with the target sequence. The SDA reaction mixture also contains a polymerase (an “SDA polymerase) and a mixture of all four deoxynucleoside-triphosphates (also called deoxynucleotides or dNTPs, i.e. dATP, dTTP, dCTP and dGTP), at least one species of which is a substituted or modified dNTP; thus, the SDA primer is modified, i.e. extended, to form a modified primer, sometimes referred to herein as a “newly synthesized strand”. The substituted dNTP is modified such that it will inhibit cleavage in the strand containing the substituted dNTP but will not inhibit cleavage on the other strand. Examples of suitable substituted dNTPs include, but are not limited, 2′-deoxyadenosine 5′-O-(1-thiotriphosphate), 5-methyldeoxycytidine 5′-triphosphate, 2′-deoxyuridine 5′-triphosphate, adn 7-deaza-2′-deoxyguanosine 5′-triphosphate. In addition, the substitution of the dNTP may occur after incorporation into a newly synthesized strand; for example, a methylase may be used to add methyl groups to the synthesized strand. In addition, if all the nucleotides are substituted, the polymerase may have 5′-3′ exonuclease activity. However, if less than all the nucleotides are substituted, the polymerase preferably lacks 5′-3′ exonuclease activity.

As will be appreciated by those in the art, the recognition site/endonuclease pair can be any of a wide variety of known combinations. The endonuclease is chosen to cleave a strand either at the recognition site, or either 3′ or 5′ to it, without cleaving the complementary sequence, either because the enzyme only cleaves one strand or because of the incorporation of the substituted nucleotides. Suitable recognition site/endonuclease pairs are well known in the art; suitable endonucleases include, but are not limited to, HincII, HindIII, AvaI, Fnu4HI, TthIIII, NcII, BstXI, BamHI, etc. A chart depicting suitable enzymes, and their corresponding recognition sites and the modified dNTP to use is found in U.S. Pat. No. 5,455,166, hereby expressly incorporated by reference.

Once nicked, a polymerase (an “SDA polymerase”) is used to extend the newly nicked strand, 5′-3′, thereby creating another newly synthesized strand. The polymerase chosen should be able to initiate 5′-3′ polymerization at a nick site, should also displace the polymerized strand downstream from the nick, and should lack 5′-3′ exonuclease activity (this may be additionally accomplished by the addition of a blocking agent). Thus, suitable polymerases in SDA include, but are not limited to, the Klenow fragment of DNA polymerase I, SEQUENASE 1.0 and SEQUENASE 2.0 (U.S. Biochemical), T5 DNA polymerase and Phi29 DNA polymerase.

In one aspect the invention provides methods of making a complex of copies of a polynucleotide molecule. In this aspect, a polynucleotide is amplified into a concatemer using RCR, resulting in a single stranded concatemer. Multiple copies of a second primer is then bound to the concatemer to initiate another round of DNA synthesis using a strand-displacing polymerase, which results in a complex of copies comprising partially displaced strands. In this embodiment, the original polynucleotide is generally a circular molecule comprising one or more adaptors. In a further embodiment, the primers used to initiate DNA synthesis are complementary or identical to a sequence of the one or more adaptors.

Cycling Probe Technology

Cycling probe technology (CPT) is a nucleic acid detection system based on signal or probe amplification rather than target nucleic acid amplification, such as is done in polymerase chain reactions (PCR). Cycling probe technology relies on a molar excess of labeled probe which contains a scissile linkage of RNA. Upon hybridization of the probe to the target, the resulting hybrid contains a portion of RNA:DNA. This area of RNA:DNA duplex is recognized by RNAseH and the RNA is excised, resulting in cleavage of the probe. The probe now consists of two smaller sequences which may be released, thus leaving the target intact for repeated rounds of the reaction. The unreacted probe is removed and the label is then detected. CPT is generally described in U.S. Pat. Nos. 5,011,769, 5,403,711, 5,660,988, and 4,876,187, and PCT published applications WO 95/05480, WO 95/1416, and WO 95/00667, all of which are specifically incorporated herein by reference.

Branched DNA Signal Amplification

“Branched DNA” signal amplification relies on the synthesis of branched nucleic acids, containing a multiplicity of nucleic acid “arms” that function to increase the amount of label that can be put onto one probe, This technology is generally described in U.S. Pat. Nos. 5,681,702, 5,597,909, 5,545,730, 5,594,117, 5,591,584, 5,571,670, 5,580,731, 5,571,670, 5,591,584, 5,624,802, 5,635,352, 5,594,118, 5,359,100, 5,124,246 and 5,681,697, all of which are hereby incorporated by reference.

Dendrimers

Similarity, dendrimers of nucleic acids serve to vastly increase the amount of label that can be added to a single molecule, using a similar idea but different compositions. This technology is as described in U.S. Pat. No. 5,175,270.

Polymerase Chain Reaction Amplification

In one embodiment, the amplification technique is PCR. The polymerase chain reaction (PCR) is widely used and described, and involves the use of primer extension combined with thermal cycling to amplify a target sequence; see U.S. Pat. No. 4,683,195 and, and PCR Essential Data, J. W. Wiley & sons, Ed. C. R. Newton, 1995, all of which are incorporated by reference. In addition, there are a number of variations of PCR which also find use in the invention, including “quantitative competitive PCR” or “QC-PCR”, “arbitrarily primed PCR” or “AP-PCR”, “immuno-PCR”, “Alu-PCR”, “PCR single strand conformational polymorphism” or “PCR-SSCP”, “reverse transcriptase PCR” or “RT-PCR”, “biotin capture PCR”, “vectorette PCR”, “panhandle PCR”, and “PCR select cDNA subtraction”, “allele-specific PCR”, among others. In some embodiments, PCR is not preferred.

Nucleic Acid Sequence Based Amplification and Transcription Mediated Amplification

Nucleic acid sequence based amplification (NASBA) is generally described in U.S. Pat. No. 5,409,818 and “Profiting from Gene-based Diagnostics”, CTB International Publishing Inc., N.J., 1996, both of which are incorporated by reference. NASBA is very similar to both TMA and QBR. Transcription mediated amplification (TMA) is generally described in U.S. Pat. Nos. 5,399,491, 5,888,779, 5,705,365, 5,710,029, all of which are incorporated by reference. The main difference between NASBA and TMA is that NASBA utilizes the addition of RNAse H to effect RNA degradation, and TMA relies on inherent RNAse H activity of the reverse transcriptase.

In general, these techniques involve the use of three enzymes: reverse transcriptase, T7 RNA polymerase, and RNase H; and the final amplification product is single-stranded RNA with a polarity opposite that of the target. The amplified RNA product can be detected using methods known in the art, for example through the use of a target-specific capture probe bound to magnetic particles in conjunction with a ruthenium-labeled detector probe and an instrument (NucliSens Reader; bioMerieux) capable of measuring electrochemiluminescence (ECL). Alternatively, polynucleotides amplified by NASBA can specifically be detected in real time through the use of molecular beacon probes included in the amplification reaction. Molecular beacon probes possess a 5′ fluorescent dye and a 3 quencher molecule (typically, 4-dimethylaminophenylazobenzoyl [DABCYL]) and are designed to form stem-loop structures that bring into close proximity the 5′ and 3′ ends of the probe, resulting in minimal fluorescence. In the presence of a complementary target sequence, the probe will hybridize to the target, separating the reporter dye from the quencher, resulting in a measurable increase in fluorescence. These techniques generally result in a single starting RNA template generating a single DNA duplex. This DNA duplex results in the creation of multiple RNA strands, which can then be used to initiate the reaction again, and amplification thus proceeds rapidly.

Single Base Extension (SBE)

In a preferred embodiment, single base extension (SBE; sometimes referred to as “minisequencing”) is used for amplification. It should also be noted that SBE finds use in sequencing and genotyping applications, as is described below. Briefly, SBE is a technique that utilizes an extension primer that hybridizes to the target nucleic acid. A polymerase (generally a DNA polymerase) is used to extend the 3′ end of the primer with a nucleotide analog labeled a detection label as described herein. Based on the fidelity of the enzyme, a nucleotide is only incorporated into the extension primer if it is complementary to the adjacent base in the target strand. Generally, the nucleotide is derivatized such that no further extensions can occur, so only a single nucleotide is added. However, for amplification reactions, this may not be necessary. Once the labeled nucleotide is added, detection of the label proceeds as described herein. See generally Sylvanen et al., Genomics 8:684-692 (1990); U.S. Pat. Nos. 5,846,710 and 5,888,819; Pastinen et al., Genomics Res. 7(6):606-614 (1997); all of which are expressly incorporated herein by reference.

Oligonucleotide Ligation Amplification (OLA)

In one embodiment, OLA is used to amplify polynucleotide molecules. OLA is referred to as the ligation chain reaction (LCR) when two-stranded substrates are used, involves the ligation of two smaller probes into a single long probe, using the target sequence as the template. In LCR, the ligated probe product becomes the predominant template as the reaction progresses. The method can be run in two different ways; in a first embodiment, only one strand of a target sequence is used as a template for ligation; alternatively, both strands may be used. See generally U.S. Pat. Nos. 5,185,243, 5,679,524 and 5,573,907; EP 0 320 308 B1; EP 0 336 731 B1; EP 0 439 182 B1; WO 90/01069; WO 89/12696; WO 97/31256; and WO 89/09835, and U.S. Ser. Nos. 60/078,102 and 60/073,011, all of which are orated by reference.

In a preferred embodiment, the single-stranded target sequence comprises a first target domain and a second target domain, which are adjacent and contiguous. A first OLA primer and a second OLA primer nucleic acids are added, that are substantially complementary to their respective target domain and thus will hybridize to the target domains. These target domains may be directly adjacent, i.e. contiguous, or separated by a number of nucleotides. If they are non-contiguous, nucleotides are added along with means to join nucleotides, such as a polymerase, that will add the nucleotides to one of the primers. The two OLA primers are then covalently attached, for example using a ligase enzyme such as is known in the art, to form a modified primer. This forms a first hybridization complex comprising the ligated probe and the target sequence. This hybridization complex is then denatured (disassociated), and the process is repeated to generate a pool of ligated probes.

In a preferred embodiment, OLA is done for two strands of a double-stranded target sequence. The target sequence is denatured, and two sets of probes are added: one set as outlined above for one strand of the target, and a separate set (i.e. third and fourth primer probe nucleic acids) for the other strand of the target. In a preferred embodiment, the first and third probes will hybridize, and the second and fourth probes will hybridize, such that amplification can occur. That is, when the first and second probes have been attached, the ligated probe can now be used as a template, in addition to the second target sequence, for the attachment of the third and fourth probes. Similarly, the ligated third and fourth probes will serve as a template for the attachment of the first and second probes, in addition to the first target strand. In this way, an exponential, rather than just a linear, amplification can occur.

Chemical Ligation Techniques

A variation of ligase chain reaction (LCR) utilizes a “chemical ligation” of sorts, as is generally outlined in U.S. Pat. Nos. 5,616,464 and 5,767,259, both of which are hereby incorporated by reference in their entirety. In this embodiment, similar to enzymatic ligation, a pair of primers are utilized, wherein the first primer is substantially complementary to a first domain of the target and the second primer is substantially complementary to an adjacent second domain of the target (although, as for enzymatic ligation, if a “gap” exists, a polymerase and dNTPs may be added to “fill in” the gap). Each primer has a portion that acts as a “side chain” that does not bind the target sequence and instead acts as one half of a stem structure that interacts non-covalently through hydrogen bonding, salt bridges, van der Waal's forces, etc. Preferred embodiments utilize substantially complementary nucleic acids as the side chains. Thus, upon hybridization of the primers to the target sequence, the side chains of the primers are brought into spatial proximity, and, if the side chains comprise nucleic acids as well, these can form side chain hybridization complexes.

At least one of the side chains of the primers comprises an activatable cross-linking agent, generally covalently attached to the side chain, which, upon activation, results in a chemical cross-link or chemical ligation. The activatable group may comprise any moiety that will allow cross-linking of the side chains, and include groups activated chemically, photonically and thermally, with photoactivatable groups being preferred. In some embodiments a single activatable group on one of the side chains is enough to result in cross-linking via interaction to a functional group on the other side chain; in alternate embodiments, activatable groups are required on each side chain.

Once the hybridization complex is formed, and the cross-linking agent has been activated such that the primers have been covalently attached, the reaction is subjected to conditions to allow for the disassociation of the hybridization complex, thus freeing the target to serve as a template for the next ligation or cross-linking. In this way, signal amplification occurs, and can be detected as described further herein.

Invasive Cleavage Techniques

In one embodiment, invasive cleavage technology is used to amplify polynucleotide molecules. This technology is described in a number of patents and patent applications, including U.S. Pat. Nos. 5,846,717; 5,614,402; 5,719,028; 5,541,311; and 5,843,669, all of which are hereby incorporated by reference in their entirety. Invasive cleavage technology is based on structure-specific nucleases that cleave nucleic acids in a site-specific manner. Two probes are used: an “invader” probe and a “signaling” probe. Both probes adjacently hybridize to a target sequence with overlap. For mismatch discrimination, the invader technology relies on complementarity at the overlap position where cleavage occurs. The enzyme cleaves at the overlap, and releases the “tail” which may or may not be labeled. This “tail” can then be detected. As described herein, many label probes known in the art can be used in accordance with this aspect of the invention.

Disposition of Concatemers and Circularized DNA Molecules on a Surface

In a preferred aspect, polynucleotide molecules, including concatemers and circularized DNA molecules, are disposed on a surface to form a random array of single molecules. Polynucleotide molecules can be fixed to surface by a variety of techniques, including covalent attachment and non-covalent attachment. In one embodiment a surface may include capture probes that form complexes, e.g., double stranded duplexes, with component of a polynucleotide molecule, such as an adaptor oligonucleotide. In other embodiments, capture probes may comprise oligonucleotide clamps, or like structures, that form triplexes with adaptors, as described in Gryaznov et al, U.S. Pat. No. 5,473,060, which is hereby incorporated in its entirety.

In another embodiment, a surface may have reactive functionalities that react with complementary functionalities on the polynucleotide molecules to form a covalent linkage, e.g., by way of the same techniques used to attach cDNAs to microarrays, e.g., Smimov et al (2004), Genes, Chromosomes & Cancer, 40: 72-77; Beaucage (2001), Current Medicinal Chemistry, 8: 1213-1244, which are incorporated herein by reference. Long DNA molecules, e.g., several hundred nucleotides or larger, may also be efficiently attached to hydrophobic surfaces, such as a clean glass surface that has a low concentration of various reactive functionalities, such as —OH groups. Attachment through covalent bonds formed between the polynucleotide molecules and reactive functionalities on the surface is also referred to herein as “chemical attachment”.

In still another embodiment, polynucleotide molecules can adsorb to a surface. In such an embodiment, the polynucleotide molecules are immobilized through non-specific interactions with the surface, or through non-covalent interactions such as hydrogen bonding, van der Waals forces, and the like.

Attachment may also include wash steps of varying stringencies to remove incompletely attached single molecules or other reagents present from earlier preparation steps whose presence is undesirable or that are nonspecifically bound to surface.

Upon attachment to a surface, single stranded polynucleotides generally fill a flattened spheroidal volume that on average is bounded by a region which is approximately equivalent to the diameter of a concatemer in random coil configuration. How compact a single stranded polynucleotide is once disposed on a surface can be affected by a number of factors, including the attachment chemistry used, the density of linkages between the polynucleotide and the surface, the nature of the surface, and the like. Preserving the compact form of the macromolecular structure of polynucleotides (including concatemers, target polynucleotides, and target sequences) on a surface can increase the signal to noise ratio, for example, a compact concatemer can result in a more intense signal from probes, (e.g., fluorescently labeled oligonucleotides) that are specifically directed to components of the concatemer.

One measure of the size of a random coil polymer, such as single stranded DNA, is a root mean square of the end-to-end distance, which is roughly a measure of the diameter of the randomly coiled structure. Such diameter, referred to herein as a “random coil diameter,” can be measured by light scatter, using instruments, such as a Zetasizer Nano System (Malvern Instruments, UK), or like instrument. Additional size measures of macromolecular structures of the invention include molecular weight, e.g., in Daltons, and total polymer length, which in the case of a branched polymer is the sum of the lengths of all its branches.

In one aspect, as illustrated in FIG. 17, macromolecular structures, e.g., concatemers, and the like, are attached to a surface (1702) within a region that is substantially equivalent to a projection of its random coil state onto surface (1702), for example, as illustrated by dashed circles (1708). An area occupied by a macromolecular structure can vary, so that in some embodiments, an expected area may be within the range of from 2-3 times the area of projection (1708) to some fraction of such area, e.g., 25-50 percent. As discussed herein, preserving the compact form of the macromolecular structure on the surface allows a more intense signal to be produced by probes, such as fluorescently labeled oligonucleotides, which are specifically directed to components of a macromolecular structure or concatemer. The size of diameter (1710) of regions (1707) and distance (1706) to the nearest neighbor region containing a single molecule are two quantities of interest in the fabrication of arrays.

A variety of distance metrics may be employed for measuring the closeness of single molecules on a surface, including center-to-center distance of regions, edge-to-edge distance of regions, and the like. Usually, center-to-center distances are employed herein. The selection of these parameters in fabricating arrays of the invention depends in part on the signal generation and detection systems used in the analytical processes. Generally, densities of single molecules are selected that permit at least thirty percent, or at least fifty percent, or at least a majority of the molecules to be resolved individually by the signal generation and detection systems used. In one aspect, a density is selected that permits at least seventy percent of the single molecules to be individually resolved. In one embodiment, scanning electron microscopy is employed, for example, with molecule-specific probes having gold nanoparticle labels, e.g., Nie et al (2006), Anal. Chem., 78: 1528-1534, which is incorporated by reference. In such an embodiment a density is selected such that at least a majority of single molecules have a nearest neighbor distance of 50 nm or greater; and in another aspect, such density is selected to ensure that at least seventy percent of single molecules have a nearest neighbor distance of 100 nm or greater. In another embodiment, optical microscopy is employed, for example with molecule-specific probes having fluorescent labels, a density is selected such that at least a majority of single molecules have a nearest neighbor distance of 200 nm or greater. In still another embodiment, a density is selected to ensure that at least seventy percent of single molecules have a nearest neighbor distance of 200 nm or greater. In still another embodiment, optical microscopy is employed, for example with molecule-specific probes having fluorescent labels, and in this embodiment a density is selected such that at least a majority of single molecules have a nearest neighbor distance of 300 nm or greater; in a further embodiment, such density is selected to ensure that at least seventy percent of single molecules have a nearest neighbor distance of 300 nm or greater, or 400 nm or greater, or 500 nm or greater, or 600 nm or greater, or 700 nm or greater, or 800 nm or greater. In still another embodiment in which optical microscopy is used, a density is selected such that at least a majority of single molecules have a nearest neighbor distance of at least twice the minimal feature resolution power of the microscope. In another aspect, polymer molecules (including polynucleotides, concatemers, target polynucleotides, and other polynucleotide molecules discussed herein) of the invention are disposed on a surface so that the density of separately detectable polymer molecules is at least 1000 per μm², or at least 10,000 per μm², or at least 100,000 per μm².

In one aspect, polynucleotide molecules on a surface are confined to an area of a discrete region. Discrete regions may be incorporated into a surface using methods known in the art and described further herein. In a preferred embodiment, discrete regions contain reactive functionalities or capture probes which can be used to immobilize the polynucleotide molecules.

The discrete regions may have defined locations in a regular array, which may correspond to a rectilinear pattern, hexagonal pattern, or the like. A regular array of such regions is advantageous for detection and data analysis of signals collected from the arrays during an analysis. Also, first- and/or second-stage amplicons confined to the restricted area of a discrete region provide a more concentrated or intense signal, particularly when fluorescent probes are used in analytical operations, thereby providing higher signal-to-noise values. Amplicons of target polynucleotides are randomly distributed on the discrete regions so that a given region is equally likely to receive any of the different single molecules. In other words, the resulting arrays are not spatially addressable immediately upon fabrication, but may be made so by carrying out an identification, sequencing and/or decoding operation. As such, the identities of the polynucleotide molecules of the invention disposed on a surface are discernable, but not initially known upon their disposition on the surface.

One embodiment in which discrete regions are used in the disposition of polynucleotide molecules on a surface is illustrated in FIG. 13. In this embodiment, the requirement of selecting densities of randomly disposed single molecules to ensure desired nearest neighbor distances is obviated by providing discrete regions on a surface, and these discrete regions are substantially the only sites for attaching single molecules to a surface. In a preferred embodiment molecules are directed to the discrete regions, because the areas between the discrete regions, referred to herein as “inter-regional areas,” are inert, in the sense that concatemers, or other macromolecular structures, do not bind to such regions. In some embodiments, such inter-regional areas may be treated with blocking agents, e.g., DNAs unrelated to concatemer DNA, other polymers, and the like.

One embodiment of the invention in which discrete regions are utilized is illustrated in FIG. 18. Isolated concatemers or amplicons (1814) are then applied to surface (1820) that has a regular array of discrete regions (1822) that each have a nearest neighbor distance (1824) that is determined by the design and fabrication of surface (1820). Arrays of discrete regions (1822) having micron and submicron dimensions for derivatizing with capture oligonucleotides or reactive functionalities can be fabricated using conventional semiconductor fabrication techniques, including electron beam lithography, nano imprint technology, photolithography, and the like. Generally, the area of discrete regions (1822) is selected, along with attachment chemistries, macromolecular structures employed, and the like, to correspond to the size of single molecules of the invention so that when single molecules are applied to surface (1820) substantially every region (1822) is occupied by no more than one single molecule.

The likelihood of having only one single molecule per discrete region may be increased by selecting a density of reactive functionalities or capture oligonucleotides that results in fewer such moieties than their respective complements on single molecules. Thus, a single molecule will “occupy” all linkages to the surface at a particular discrete region, thereby reducing the chance that a second single molecule will also bind to the same region. In particular, in one embodiment, substantially all the capture oligonucleotides in a discrete region hybridize to adaptor oligonucleotides in a single macromolecular structure. In a further embodiment, a discrete region contains a number of reactive functionalities or capture oligonucleotides that is from about ten percent to about fifty percent of the number of complementary functionalities or adaptor oligonucleotides of a single molecule.

The length and sequence(s) of capture oligonucleotides may vary widely, and may be selected in accordance with well known principles, e.g., Wetmur, Critical Reviews in Biochemistry and Molecular Biology, 26: 227-259 (1991); Britten and Davidson, chapter 1 in Hames et al, editors, Nucleic Acid Hybridization: A Practical Approach (IRL Press, Oxford, 1985). In one embodiment, the lengths of capture oligonucleotides are in a range of from about 6 to about 50 nucleotides, in a further embodiment, the lengths of capture oligonucleotides are in a range of from about 8 to about 30 nucleotides; in a still further embodiment, the lengths are from about 10 to about 24 nucleotides. Lengths and sequences of capture oligonucleotides are selected (i) to provide effective binding of macromolecular structures to a surface, so that losses of macromolecular structures are minimized during steps of analytical operations, such as washing, etc., and (ii) to avoid interference with analytical operations on analyte molecules, particularly when analyte molecules are DNA fragments in a concatemer.

In regard to providing effective binding of macromolecular structures to a surface, in accordance with one aspect of the invention, sequences and lengths are selected to provide duplexes between capture oligonucleotides and their complements that are sufficiently stable so that they do not dissociate in a stringent wash.

In regard to avoiding interference with analytical molecules, if DNA fragments are from a particular species of organism, then databases, when available, may be used to screen potential capture sequences that may form spurious or undesired hybrids with DNA fragments.

Other factors in selecting sequences for capture oligonucleotides are similar to those considered in selecting primers, hybridization probes, oligonucleotide tags, and the like, for which there is ample guidance in the art.

In some embodiments, a discrete region may contain more than one kind of capture oligonucleotide, and each different capture oligonucleotide may have a different length and sequence.

In one aspect of the invention, regular arrays of discrete regions are employed, and sequences of capture oligonucleotides are selected so that the sequences of capture oligonucleotide at nearest neighbor regions have different sequences. In a rectilinear array, such configurations are achieved by establishing rows of alternating sequence types. In one embodiment, a surface may have a plurality of subarrays of discrete regions wherein each different subarray has capture oligonucleotides with distinct nucleotide sequences different from those of the other subarrays. A plurality of subarrays may include 2 subarrays, or 4 or fewer subarrays, or 8 or fewer subarrays, or 16 or fewer subarrays, or 32 or fewer subarrays, or 64 of fewer subarrays. In still another embodiment, a surface may include 5000 or fewer subarrays.

In one aspect, capture probes are attached to the surface of an array by a spacer molecule, e.g., polyethylene glycol, or like inert chain, as is done with microarrays, in order to minimize undesired affects of surface groups or interactions with the capture oligonucleotides or other reagents.

In another aspect, if enzymatic processing is not required, capture oligonucleotides may comprise non-natural nucleosidic units and/or linkages that confer favorable properties, such as increased duplex stability; such compounds include, but not limited to, peptide nucleic acids (PNAs), locked nucleic acids (LNA), oligonucleotide N3′→P5′ phosphoramidates, oligo-2′-O-alkylribonucleotides, and the like.

In one aspect, the area of discrete regions (1822) is less than 1 μm²; and in another aspect, the area of discrete regions (1822) is in the range of from 0.04 μm² to 1 μm²; and in still another aspect, the area of discrete regions (1822) is in the range of from 0.2 μm² to 1 μm². In another aspect, when discrete regions are approximately circular or square in shape so that their sizes can be indicated by a single linear dimension, the size of such regions are in the range of from 125 nm to 250 nm, or in the range of from 200 nm to 500 nm. In one aspect center-to-center distances of nearest neighbors of regions (1824) are in the range of from 0.25 μm to 20 μm; and in another aspect, such distances are in the range of from 1 μM to 10 μm, or in the range from 50 to 1000 nm. Generally, discrete regions are designed such that a majority of the discrete regions on a surface are optically resolvable. In one aspect, regions (1822) may be arranged on surface (1820) in virtually any pattern in which regions (1822) have defined locations, i.e. in any regular array, which makes signal collection and data analysis functions more efficient. Such patterns include, but are not limited to, concentric circles of regions, spiral patterns, rectilinear patterns, hexagonal patterns, and the like. Preferably, regions (1822) are arranged in a rectilinear or hexagonal pattern (1820).

Supports and Surfaces of the Invention

A wide variety of supports may be used with the compositions and methods of the invention to form random arrays. In one aspect, supports are rigid solids that have a surface, preferably a substantially planar surface so that single molecules to be interrogated are in the same plane. The latter feature permits efficient signal collection by detection optics, for example. In another aspect, the support comprises beads, wherein the surface of the beads comprise reactive functionalities or capture probes that can be used to immobilize polynucleotide molecules.

In still another aspect, solid supports of the invention are nonporous, particularly when random arrays of single molecules are analyzed by hybridization reactions requiring small volumes. Suitable solid support materials include materials such as glass, polyacrylamide-coated glass, ceramics, silica, silicon, quartz, various plastics, and the like. In one aspect, the area of a planar surface may be in the range of from 0.5 to 4 cm². In one aspect, the solid support is glass or quartz, such as a microscope slide, having a surface that is uniformly silanized. This may be accomplished using conventional protocols, e.g., acid treatment followed by immersion in a solution of 3-glycidoxypropyl trimethoxysilane, N,N-diisopropylethylamine, and anhydrous xylene (8:1:24 v/v) at 80° C., which forms an epoxysilanized surface. e.g., Beattie et al (1995), Molecular Biotechnology, 4: 213. Such a surface is readily treated to permit end-attachment of capture oligonucleotides, e.g., by providing capture oligonucleotides with a 3′ or 5′ triethylene glycol phosphoryl spacer (see Beattie et al, cited above) prior to application to the surface. Further embodiments for functionalizing and further preparing surfaces for use in the present invention are described in U.S. patent application Ser. No. 11/451,691.

In embodiments of the invention in which patterns of discrete regions are required, photolithography, electron beam lithography, nano imprint lithography, and nano printing may be used to generate such patterns on a wide variety of surfaces, e.g., Pirrung et al, U.S. Pat. No. 5,143,854; Fodor et al, U.S. Pat. No. 5,774,305; Guo, (2004) Journal of Physics D: Applied Physics, 37: R123-141; which are incorporated herein by reference.

In one aspect, surfaces containing a plurality of discrete regions are fabricated by photolithography. A commercially available, optically flat, quartz substrate is spin coated with a 100-500 nm thick layer of photo-resist. The photo-resist is then baked on to the quartz substrate. An image of a reticle with a pattern of regions to be activated is projected onto the surface of the photo-resist, using a stepper. After exposure, the photo-resist is developed, removing the areas of the projected pattern which were exposed to the UV source. This is accomplished by plasma etching, a dry developing technique capable of producing very fine detail. The substrate is then baked to strengthen the remaining photo-resist. After baking, the quartz wafer is ready for functionalization. The wafer is then subjected to vapor-deposition of 3-aminopropyldimethylethoxysilane. The density of the amino functionalized monomer can be tightly controlled by varying the concentration of the monomer and the time of exposure of the substrate. Only areas of quartz exposed by the plasma etching process may react with and capture the monomer. The substrate is then baked again to cure the monolayer of amino-functionalized monomer to the exposed quartz. After baking, the remaining photo-resist may be removed using acetone. Because of the difference in attachment chemistry between the resist and silane, aminosilane-functionalized areas on the substrate may remain intact through the acetone rinse. These areas can be further functionalized by reacting them with p-phenylenediisothiocyanate in a solution of pyridine and N—N-dimethylformamide. The substrate is then capable of reacting with amine-modified oligonucleotides. Alternatively, oligonucleotides can be prepared with a 5′-carboxy-modifier-c10 linker (Glen Research). This technique allows the oligonucleotide to be attached directly to the amine modified support, thereby avoiding additional functionalization steps.

In another aspect, surfaces containing a plurality of discrete regions are fabricated by nano-imprint lithography (NIL). For DNA array production, a quartz substrate is spin coated with a layer of resist, commonly called the transfer layer. A second type of resist is then applied over the transfer layer, commonly called the imprint layer. The master imprint tool then makes an impression on the imprint layer. The overall thickness of the imprint layer is then reduced by plasma etching until the low areas of the imprint reach the transfer layer. Because the transfer layer is harder to remove than the imprint layer, it remains largely untouched. The imprint and transfer layers are then hardened by heating. The substrate is then put into a plasma etcher until the low areas of the imprint reach the quartz. The substrate is then derivatized by vapor deposition as described above.

In another aspect, surfaces containing a plurality of discrete regions are fabricated by nano printing. This process uses photo, imprint, or e-beam lithography to create a master mold, which is a negative image of the features required on the print head. Print heads are usually made of a soft, flexible polymer such as polydimethylsiloxane (PDMS). This material, or layers of materials having different properties, are spin coated onto a quartz substrate. The mold is then used to emboss the features onto the top layer of resist material under controlled temperature and pressure conditions. The print head is then subjected to a plasma based etching process to improve the aspect ratio of the print head, and eliminate distortion of the print head due to relaxation over time of the embossed material. Random array substrates are manufactured using nano-printing by depositing a pattern of amine modified oligonucleotides onto a homogenously derivatized surface. These oligonucleotides would serve as capture probes for the RCR products. One potential advantage to nano-printing is the ability to print interleaved patterns of different capture probes onto the random array support. This would be accomplished by successive printing with multiple print heads, each head having a differing pattern, and all patterns fitting together to form the final structured support pattern. Such methods allow for some positional encoding of DNA elements within the random array. For example, control concatemers containing a specific sequence can be bound at regular intervals throughout a random array.

In still another aspect, a high density array of capture oligonucleotide spots of sub micron size is prepared using a printing head or imprint-master prepared from a bundle, or bundle of bundles, of about 10,000 to 100 million optical fibers with a core and cladding material. By pulling and fusing fibers a unique material is produced that has about 50-1000 nm cores separated by a similar or 2-5 fold smaller or larger size cladding material. By differential etching (dissolving) of cladding material a nano-printing head is obtained having a very large number of nano-sized posts. This printing head may be used for depositing oligonucleotides or other biological (proteins, oligopeptides, DNA, aptamers) or chemical compounds such as silane with various active groups. In one embodiment the glass fiber tool is used as a patterned support to deposit oligonucleotides or other biological or chemical compounds. In this case only posts created by etching may be contacted with material to be deposited. Also, a flat cut of the fused fiber bundle may be used to guide light through cores and allow light-induced chemistry to occur only at the tip surface of the cores, thus eliminating the need for etching. In both cases, the same support may then be used as a light guiding/collection device for imaging fluorescence labels used to tag oligonucleotides or other reactants. This device provides a large field of view with a large numerical aperture (potentially >1). Stamping or printing tools that perform active material or oligonucleotide deposition may be used to print 2 to 100 different oligonucleotides in an interleaved pattern. This process requires precise positioning of the print head to about 50-500 nm. This type of oligonucleotide array may be used for attaching 2 to 100 different DNA populations such as different source DNA. They also may be used for parallel reading from sub-light resolution spots by using DNA specific anchors or tags. Information can be accessed by DNA specific tags, e.g., 16 specific anchors for 16 DNAs and read 2 bases by a combination of 5-6 colors and using 16 ligation cycles or one ligation cycle and 16 decoding cycles. This way of making arrays is efficient if limited information (e.g., a small number of cycles) is required per fragment, thus providing more information per cycle or more cycles per surface.

In one embodiment “inert” concatemers are used to prepare a surface for attachment of test concatemers. The surface is first covered by capture oligonucleotides complementary to the binding site present on two types of synthetic concatemers; one is a capture concatemer, the other is a spacer concatemer. The spacer concatemers do not have DNA segments complementary to the adapter used in preparation of test concatemers and they are used in about 5-50, preferably 10× excess to capture concatemers. The surface with capture oligonucleotide is “saturated” with a mix of synthetic concatemers (prepared by chain ligation or by RCR) in which the spacer concatemers are used in about 10-fold (or 5 to 50-fold) excess to capture concatemers. Because of the 10:1 ratio between spacer and capture concatemers, the capture concatemers are mostly individual islands in a sea of spacer concatemers. The 10:1 ratio provides that two capture concatemers are on average separated by two spacer concatemers. If concatemers are about 200 nm in diameter, then two capture concatemers are at about 600 nm center-to-center spacing. This surface is then used to attach test concatemers or other molecular structures that have a binding site complementary to a region of the capture concatemers but not present on the spacer concatemers. Capture concatemers may be prepared to have less copies than the number of binding sites in test concatemers to assure single test concatemer attachment per capture concatemer spot. Because the test DNA can bind only to capture concatemers, an array of test concatemers may be prepared that have high site occupancy without congregation. Due to random attachment, some areas on the surface may not have any concatemers attached, but these areas with free capture oligonucleotide may not be able to bind test concatemers since they are designed not to have binding sites for the capture oligonucleotide. An array of individual test concatemers as described would not be arranged in a grid pattern. An ordered grid pattern should simplify data collection because less pixels are needed and less sophisticated image analysis systems are needed also.

In one aspect, multiple arrays of the invention may be placed on a single surface. For example, patterned array substrates may be produced to match the standard 96 or 384 well plate format. A production format can be an 8×12 pattern of 6 mm×6 mm arrays at 9 mm pitch or 16×24 of 3.33 mm×3.33 mm array at 4.5 mm pitch, on a single piece of glass or plastic and other optically compatible material. In one example each 6 mm×6 mm array consists of 36 million 250-500 nm square regions at 1 micrometer pitch. Hydrophobic or other surface or physical barriers may be used to prevent mixing different reactions between unit arrays.

In a preferred aspect, sites on a surface in which polynucleotide molecules of the invention are disposed are surrounded by inter-regional areas which are inert. In such an aspect, non-specific binding in the inter-regional areas is minimized by controlling the physical and chemical features of these inter-regional areas. Methods for establishing such inert inter-regional areas are well known in the art. For example, the inter-regional areas may be prepared with hexamethyldisilazane (HNDS), or a similar agent covalently bonded to the surface, to be hydrophobic and hence unsuitable to hydrophilic bonding of the DNA samples. Similarly, the inter-regional areas may be coated with a chemical agent such as a fluorine-based carbon compound that renders the areas unreactive to DNA samples.

In another aspect of the invention, random arrays are prepared using nanometer-sized beads. Sub-micron glass or other types of beads (e.g., in the 20-50 nm range) are used which are derivatized with a short oligonucleotide, e.g., 6-30 nucleotides, complementary to an adaptor oligonucleotide in the circles used to generate concatemers. The number of oligonucleotides on the bead and the length of the sequence can be controlled to weakly bind the concatemers in solution. In one embodiment, the density of capture probes can be controlled through the use of shorter oligonucleotides that have the same attachment chemistry with the capture probe. Also, much smaller nano-beads (20-50 nm) can be used in accordance with this aspect of the invention. After binding concatemers, the beads can be allowed to settle on the surface of an array substrate. Array conditions may be selected to permit preferential binding to the surface, thereby forming a spaced array of concatemers. If the beads are magnetic, a magnetic field can be used to pull the beads to the surface and may also be used to move them around the surface. Alternatively, a centrifuge may be used to concentrate the beads on the surface. In still another embodiment, horizontal or tilting movements of the surface can be used to move beads from the inter-regional areas to settle in discrete regions manufactured into the surface as described herein.

Methods of Identifying Nucleotide Sequence

In a preferred aspect, random arrays of the invention are used to identify a nucleotide sequence of one or more target polynucleotides. As discussed herein target polynucleotides may be in the form of concatemers, may be linear or circular, and will generally contain one or more target sequences, where the target sequences in a preferred embodiment comprise one or more fragments of the target polynucleotide and are generally shorter in length than the target polynucleotide.

Target sequences can in turn comprise different target domains; for example, a first target domain of the sample target sequence may hybridize to a capture probe and a second target domain may hybridize to a label probe, etc. The target domains may be adjacent to each other or separated (such as by an adaptor) as indicated. Unless specified, the terms “first” and “second” are not meant to confer an orientation of the sequences with respect to the 5′-3′ orientation of the target sequence. For example, assuming a 5′-3′ orientation of the complementary target sequence, the first target domain may be located either 5′ to the second domain, or 3′ to the second domain.

Techniques for identifying polynucleotide sequences fall into five general categories: (1) techniques that rely on traditional hybridization methods that utilize the variation of stringency conditions (temperature, buffer conditions, etc.) to distinguish nucleotides at the detection position; (2) extension techniques that add a base (“the base”) to basepair with the nucleotide at the detection position; (3) ligation techniques, that rely on the specificity of ligase enzymes (or, in some cases, on the specificity of chemical techniques), such that ligation reactions occur preferentially if perfect complementarity exists at the detection position; (4) cleavage techniques, that also rely on enzymatic or chemical specificity such that cleavage occurs preferentially if perfect complementarity exists; and (5) techniques that combine these methods. Each of these techniques may be used in a solution based assay, wherein the reaction is done in solution and a reaction product is bound to the array for subsequent detection, or in solid phase assays, where the reaction occurs on the surface and is detected.

Sequencing by hybridization has been described (Drmanac et al., Genomics 4:114 (1989); Koster et al., Nature Biotechnology 14:1123 (1996); U.S. Pat. Nos. 5,525,464; 5,202,231 and 5,695,940, 6,864,052; 6,309,824; 6,401,267 and U.S. Patent Pub. No. 2005/0191656, among others).

Sequencing by synthesis is an alternative to gel-based sequencing. These methods add and read only one base (or at most a few bases, typically of the same type) prior to polymerization of the next base. This can be referred to as “time resolved” sequencing, to contrast from “gel-resolved” sequencing. Sequencing by synthesis has been described in U.S. Pat. Nos. 4,971,903; 6,828,100; 6,833,256; 6,911,345, as well as in Hyman, Anal. Biochem. 174:423 (1988); Rosenthal, International Patent Application Publication 761107 (1989); Metzker et al., Nucl. Acids Res. 22:4259 (1994); Jones, Biotechniques 22:938 (1997); Ronaghi et al., Anal. Biochem. 242:84 (1996); Ronaghi et al (1998), Science, 281: 363-365; Nyren et al., Anal. Biochem. 151:504 (1985); and Li et al, Proc. Natl. Acad. Sci., 100: 414-419 (2003). One promising sequencing by synthesis method is based on the detection of the pyrophosphate (PPi) released during the DNA polymerase reaction. As nucleotriphosphates are added to a growing nucleic acid chain, they release PPi. This release can be quantitatively measured by the conversion of PPi to ATP by the enzyme sulfurylase, and the subsequent production of visible light by firefly luciferase.

Detection of ATP sulfurylase activity is described in Karamohamed and Nyren, Anal. Biochem. 271:81 (1999). Sequencing using reversible chain terminating nucleotides is described in U.S. Pat. Nos. 5,902,723 and 5,547,839, and Canard and Arzumanov, Gene 11:1 (1994), and Dyatkina and Arzumanov, Nucleic Acids Symp Ser 18:117 (1987). Reversible chain termination with DNA ligase is described in U.S. Pat. No. 5,403,708. Time resolved sequencing is described in Johnson et al., Anal. Biochem. 136:192 (1984). Single molecule analysis is described in U.S. Pat. No. 5,795,782 and Elgen and Rigler, Proc. Natl. Acad Sci USA 91(13):5740 (1994), all of which are hereby expressly incorporated by reference in their entirety. Several assay systems have been described that capitalize on this mechanism. See for example WO93/23564, WO 98/28440 and WO98/13523, all of which are expressly incorporated by reference. A preferred method is described in Ronaghi et al., Science 281:363 (1998). In this method, the four deoxynucleotides (dATP, dGTP, dCTP and dTTP; collectively dNTPs) are added stepwise to a partial duplex comprising a sequencing primer hybridized to a single stranded DNA template and incubated with DNA polymerase, ATP sulfurylase, luciferase, and optionally a nucleotide-degrading enzyme such as apyrase. A. dNTP is only incorporated into the growing DNA strand if complimentary to the base in the template strand. The synthesis of DNA is accompanied by the release of PPi equal in molarity to the incorporated dNTP. The PPi is converted to ATP and the light generated by the luciferase is directly proportional to the amount of ATP. In some cases the unincorporated dNTPs and the produced ATP are degraded between each cycle by the nucleotide degrading enzyme.

Ligation-based methods of sequencing are also known in the art, see e.g., Shendure et al (2005), Science, 309: 1728-1739.

The oligonucleotide ligation assay (OLA; sometimes referred to as the ligation chain reaction (LCR)) involves the ligation of at least two smaller probes into a single long probe, using the target sequence as the template for the ligase. See generally U.S. Pat. Nos. 5,185,243, 5,679,524 and 5,573,907; EP 0 320 308 B1; EP 0 336 731 B1; EP 0 439 182 B1; WO 90/01069; WO 89/12696; and WO 89/09835, all of which are incorporated by reference.

Sequencing using mass spectrometry techniques have also been described; see Koster et al., Nature Biotechnology 14:1123 (1996).

Many of the above described methods require a primer nucleic acid (including nucleic acid analogs) that is hybridized to a target sequence to form a hybridization complex, and an enzyme is added that in some way modifies the primer to form a modified primer. For example, PCR generally requires two primers, dNTPs and a DNA polymerase; LCR requires two primers that adjacently hybridize to the target sequence and a ligase; CPT requires one cleavable primer and a cleaving enzyme; invasive cleavage requires two primers and a cleavage enzyme; etc. Thus, in general, a target nucleic acid is added to a reaction mixture that comprises the necessary amplification components, and a modified primer is formed. In general, the modified primer comprises a detectable label, such as a fluorescent label, which is either incorporated by the enzyme or present on the original primer. As required, the unreacted primers are removed, in a variety of ways, as will be appreciated by those in the art and outlined herein. The modified primer can be detected and/or quantified using methods known in the art, and its presence can be used to identify and quantify the associated target sequence(s). In some cases, the newly modified primer serves as a target sequence for a secondary reaction, which then produces a number of amplified strands, which can also be detected as described herein.

In a preferred aspect, sequencing techniques known in the art and described herein are employed on concatemers comprising target sequences. As discussed herein, target sequences can be prepared using known techniques. Once prepared, the target sequence can be used in a variety of reactions for a variety of reasons. For example, in a specific aspect of the invention, genotyping reactions are done. Similarly, these reactions can also be used to detect the presence or absence of a target sequence. In addition, in any reaction, quantitation of the amount of a target sequence may be done. While the discussion below focuses on genotyping reactions, the discussion applies equally to detecting the presence of target sequences and/or their quantification.

In a preferred aspect of specific embodiments, a target sequence comprises a position for which sequence information is desired, generally referred to herein as the “detection position” or “detection locus”. In a particularly preferred aspect of specific embodiments, the detection position is a single nucleotide, although in some aspects, it may comprise a plurality of nucleotides, either contiguous with each other or separated by one or more nucleotides. By “plurality” as used herein is meant at least two. As used herein, the base which basepairs with a detection position base in a hybrid is termed a “readout position” or an “interrogation position”. “Readout” means a parameter, or parameters, which are measured and/or detected that can be converted to a number or value. In some contexts, readout may refer to an actual numerical representation of such collected or recorded data. For example, a readout of fluorescent intensity signals from a microarray is the position and fluorescence intensity of a signal being generated at each hybridization site of the microarray; thus, such a readout may be registered or stored in various ways, for example, as an image of the microarray, as a table of numbers, or the like.

In some aspects, as is discussed herein, the target sequence may not be the sample target sequence but instead is a product of a reaction herein, sometimes referred to herein as a “secondary” or “derivative” target sequence. Thus, for example, in a single base extension (SEE) method, the extended primer may serve as the target sequence; similarly, in invasive cleavage variations, the cleaved detection sequence may serve as the target sequence.

In one aspect, a method of determining a nucleotide sequence of a target polynucleotide in accordance with the invention comprises the following steps: (a) generating a plurality of target concatemers from the target polynucleotide, each target concatemer comprising multiple copies of a fragment of the target polynucleotide and the plurality of target concatemers including a number of fragments that substantially covers the target polynucleotide; (b) forming a random array of target concatemers fixed to a surface at a density such that at least a majority of the target concatemers are optically resolvable; (c) identifying a sequence of at least a portion of each fragment in each target concatemer; and (d) reconstructing the nucleotide sequence of the target polynucleotide from the identities of the sequences of the portions of fragments of the concatemers.

As used herein, “substantially covers” means that the amount of nucleotides (i.e., target sequences) analyzed contains an equivalent of at least two copies of the target polynucleotide, or in another aspect, at least ten copies, or in another aspect, at least twenty copies, or in another aspect, at least 100 copies. Target polynucleotides may include DNA fragments, including genomic DNA fragments and cDNA fragments, and RNA fragments. Guidance for the step of reconstructing target polynucleotide sequences can be found in the following references, which are incorporated by reference: Lander et al, Genomics, 2: 231-239 (1988); Vingron et al, J. Mol. Biol., 235: 1-12 (1994); and like references.

In one aspect, a sequencing method for use with the invention for determining sequences in a plurality of DNA or RNA fragments comprises the following steps: (a) generating a plurality of polynucleotide molecules each comprising a concatemer of a DNA or RNA fragment; (b) forming a random array of polynucleotide molecules fixed to a surface at a density such that at least a majority of the target concatemers are optically resolvable; and (c) identifying a sequence of at least a portion of each DNA or RNA fragment in resolvable polynucleotides using at least one chemical reaction of an optically detectable reactant.

In a further aspect of specific embodiments, the optically detectable reactant used in identifying the sequence is an oligonucleotide. In another aspect, the optically detectable reactant is a nucleoside triphosphate, e.g., a fluorescently labeled nucleoside triphosphate that may be used to extend an oligonucleotide hybridized to a concatemer. In another aspect, the optically detectable reagent is an oligonucleotide formed by ligating a first and second oligonucleotides that form adjacent duplexes on a concatemer. In another aspect, the chemical reaction of an optically detectable reactant is synthesis of DNA or RNA, e.g., by extending a primer hybridized to a concatemer. In yet another aspect, the optically detectable reactant is a nucleic acid binding oligopeptide or polypeptide or protein.

In one aspect, parallel sequencing of polynucleotide analytes of concatemers on a random array is accomplished by combinatorial SBH (cSBH). In a preferred aspect, a first and second sets of oligonucleotide probes (also referred to herein as “label probes”) are provided, wherein each sets has member probes that comprise oligonucleotides having every possible sequence for the defined length of probes in the set. For example, if a set contains probes of length six, then it contains 4096 (=4⁶) probes. In another aspect, first and second sets of oligonucleotide probes comprise probes having selected nucleotide sequences designed to detect selected sets of target polynucleotides. Sequences are determined by hybridizing one probe or pool of probe, hybridizing a second probe or a second pool of probes, ligating probes that form perfectly matched duplexes on their target sequences, identifying those probes that are ligated to obtain sequence information about the target sequence, repeating the steps until all the probes or pools of probes have been hybridized, and determining the nucleotide sequence of the target from the sequence information accumulated during the hybridization and identification steps.

In one aspect of specific embodiments, the sets may be divided into subsets that are used together in pools, as disclosed in U.S. Pat. No. 6,864,052. Probes from the first and second sets may be hybridized to target sequences either together or in sequence, either as entire sets or as subsets, or pools. In one aspect, lengths of the probes in the first or second sets are in the range of from 5 to 10 nucleotides, and in another aspect, in the range of from 5 to 7 nucleotides, so that when ligated they form ligation products with a length in the range of from 10 to 20, and from 10 to 14, respectively.

In another aspect, the sequence identity of each attached DNA concatemer may be determined by a “signature” approach. About 50 to 100 or possibly 200 probes are used such that about 25-50% or in some applications 10-30% of attached concatemers will have a full match sequence for each probe. This type of data allows each amplified DNA fragment within a concatemer to be mapped to the reference sequence. For example, by such a process one can score 64 4-mers (i.e. 25% of all possible 256 4-mers) using 16 hybridization/stripoff cycles in a 4 colors labeling schema. On a 60-70 base fragment amplified in a concatemer about 16 of 64 probes will be positive since there are 64 possible 4mers present in a 64 base long sequence (i.e. one quarter of all possible 4mers). Unrelated 60-70 base fragments will have a very different set of about 16 positive decoding probes. A combination of 16 probes out of 64 probes has a random chance of occurrence in 1 of every one billion fragments which practically provides a unique signature for that concatemer. Scoring 80 probes in 20 cycles and generating 20 positive probes create a signature even more likely to be unique: occurrence by chance is 1 in a billion billions. Previously, a “signature” approach was used to select novel genes from cDNA libraries. An implementation of a signature approach is to sort obtained intensities of all tested probes and select up to a predefined (expected) number of probes that satisfy the positive probe threshold. These probes will be mapped to sequences of all DNA fragments (sliding window of a longer reference sequence may be used) expected to be present in the array. The sequence that has all or a statistically sufficient number of the selected positive probes is assigned as the sequence of the DNA fragment in the given concatemer. In another approach an expected signal can be defined for all used probes using their pre measured full match and mismatch hybridization/ligation efficiency. In this case a measure similar to the correlation factor can be calculated.

In an exemplary aspect 4-mers (probes 4 bases in length) are scored through ligation of pairs of probes, for example: N₍₅₋₇₎BBB with BN₍₇₋₉₎, where B is the defined base and N is a degenerate base. For generating signatures on longer DNA concatemer probes, more unique bases will be used. For example, a 25% positive rate in a fragment 1000 bases in length would be achieved by N₍₄₋₆₎BBBB and BBN₍₆₋₈₎. Note that longer fragments need the same number of about 60-80 probes (15-20 ligation cycles using 4 colors). In one aspect all probes of a given length (e.g., 4096 N₂₋₄BBBBBBN₂₋₄) or all ligation pairs may be used to determine complete sequence of the DNA in a concatemer. For example, 1024 combinations of N₍₅₋₇₎B₃ and BBN₍₆₋₈₎ may be scored (256 cycles if 4 colors are used) to determine sequence of DNA fragments of up to about 250 bases, preferably up to about 100 bases.

The decoding of sequencing probes with large numbers of Ns may be prepared from multiple syntheses of subsets of sequences at degenerated bases to minimize difference in the efficiency. Each subset is added to the mix at a proper concentration. Also, some subsets may have more degenerated positions than others. For example, each of 64 probes from the set N₍₅₋₇₎BBB may be prepared in 4 different synthesis. One is regular all 5-7 bases to be fully degenerated; second is N0-3(A,T)₅BBB; third is N0-2(A,T)(G,C)(A,T)(G,C)(A,T)BBB, and the fourth is N0-2(G,C)(A,T)(G,C)(A,T)(G,C)BBB.

Oligonucleotide preparation from the three specific syntheses is added in to regular synthesis in experimentally determined amounts to increase hybrid generation with target sequences that have in front of the BBB sequence an AT rich (e.g., AATAT) or (A or T) and (G or C) alternating sequence (e.g., ACAGT or GAGAC). These sequences are expected to be less efficient in forming a hybrid. All 1024 target sequences can be tested for the efficiency to form hybrid with N₀₋₃NNNNN BBB probes and those types that give the weakest binding may be prepared in about 1-10 additional synthesis and added to the basic probe preparation.

In another exemplary aspect of specific embodiments, 12 bases of a target concatemer are decoded using a combination of hybridization and ligation based assays. In this aspect, one half of the sequence is determined by utilizing the hybridization specificity of short probes and the ligation specificity of fully matched hybrids. Six to ten bases adjacent to the 12 mer are predefined and act as a support for a 6mer to 10-mer oligonucleotide. This short 6mer will ligate at its 3-prime end to one of 4 labeled 6-mers to 10-mers. These decoding probes consist of a pool of 4 oligonucleotides in which each oligonucleotide consists of 4-9 degenerate bases and 1 defined base. This oligonucleotide will also be labeled with one of four fluorescent labels. Each of the 4 possible bases A, C, G, or T will therefore be represented by a fluorescent dye. For example these 5 groups of 4 oligonucleotides and one universal oligonucleotide (Us) can be used in the ligation assays to sequence first 5 bases of 12-mers: δ=each of 4 bases associated with a specific dye or tag at the end:

UUUUUUUU.BNNNNNNN (SEQ ID NO: 5) UUUUUUUU.NBNNNNNN (SEQ ID NO: 6) UUUUUUUU.NNBNNNNN (SEQ ID NO: 7) UUUUUUUU.NMNBNNNN (SEQ ID NO: 8) UUUUUUUU.NNNNBNNN (SEQ ID NO: 9)

Six or more bases can be sequenced with additional probe pools. To improve discrimination at positions near the center of the 12mer (the 12 bases of the concatemer being sequenced) the 6mer oligonucleotide can be positioned further into the 12mer sequence. This will necessitate the incorporation of degenerate bases into the 3-prime end of the non-labeled oligonucleotide to accommodate the shift. This is an example of decoding probes for position 6 and 7 in the 12-mer.

UUUUUUNN.NNNBNNNN (SEQ ID NO: 8) UUUUUUNN.NNNNBNNN (SEQ ID NO: 9)

In a similar way the 6 bases from the right side of the 12mer can be decoded by using a fixed oligonucleotide and 5-prime labeled probes. In the above described system 6 cycles are required to define 6 bases of one side of the 12mer. With redundant cycle analysis of bases distant to the ligation site this may increase to 7 or 8 cycles. In total then, complete sequencing of the 12mer could be accomplished with 12-16 cycles of ligation.

In another exemplary aspect, polynucleotide molecules on a random array can be sequenced combining two distinct types of libraries of detector probes. In this approach one library has probes of the general type N₃₋₈B₄₋₆ (anchors) that are ligated with the first 2 or 3 or 4 probes/probe pools from the other set BN₆₋₈, NBN₅₋₇, N₂BN₄₋₆, and N₃BN₃₋₅. In this aspect, a few cycles are used to test a probe from the first library with 2-4 or even more probes from the second library in order to read longer continuous sequences (such as 5-6+3-4-8-10) in just 3-4 cycles. One or more of the probes in one or both libraries can be tagged using physical and chemical design (such as by adding a specific number of bases to provide a distinct hybrid stability, or altering GC content to affect stability), and through labels such as fluorescent labels.

Using multiple colors or other labels allows for parallel and multiplex sequencing of a random array. In one exemplary aspect probes are tagged with different oligonucleotide sequences made of natural bases or new synthetic bases (such as isoG and isoC). Tags can be designed to have very precise binding efficiency with their anti-tags using different oligonucleotide lengths (about 6-24 bases) and/or sequence including GC content. For example 4 different tags may be designed that can be recognized with specific anti-tags in 4 consecutive cycles or in one hybridization cycle followed by a discriminative wash. In the discriminative wash initial signal is reduced to 95-99%, 30-40%, 10-20% and 0-5% for each tag, respectively. In this case by obtaining two images 4 measurements are obtained assuming that probes with different tags will rarely hybridize to the same dot. Another benefit of having many different tags even if they are consecutively decoded (or 2-16 at a time labeled with 2-16 distinct colors) is the ability to use a large number of individually recognizable probes in one assay reaction. This way a 4-64 times longer assay time (that may provide more specific or stronger signal) may be affordable if the probes are decoded in short incubation and removal reactions.

In some aspects, the decoding process requires the use of 48-96 or more decoding probes. These pools will be further combined into 12-24 or more pools by encoding them with four fluorophores, each having different emission spectra. Each array requires about 12-24 cycles to decode. Each cycle consists of a hybridization, wash, array imaging, and strip-off step. These steps, in their respective orders, may take for the above example 5, 2, 12, and 5 minutes each, for a total of 24 minutes each cycle, or roughly 5-10 hours for each array, if the operations were performed linearly. The time to decode each array can be reduced by a factor of two by allowing the system to image constantly. To accomplish this, the imaging of two separate substrates on each microscope is staggered. While one substrate is being reacted, the other substrate is imaged.

In another exemplary aspect of specific embodiments, a decoding cycle using combinatorial sequencing by hybridization (cSBH) includes the following steps: (i) set temperature of array to hybridization temperature (usually in the range 5-25° C.); (ii) use robot pipetter to pre mix a small amount of decoding probe with the appropriate amount of hybridization buffer; (iii) pipette mixed reagents into hybridization chamber; (iv) hybridize for predetermined time; (v) drain reagents from chamber using pump (syringe or other); (vi) add a buffer to wash mismatches of non-hybrids; (vii) adjust chamber temperature to appropriate wash temp (about 10-40° C.); (viii) drain chamber; (ix) add more wash buffer if needed to improve imaging; (x) image each array, (xii) remove buffer; and (xiii) start the next hybridization cycle with the next decoding probe pool in set.

In one aspect, polynucleotide molecules amplified using NASBA and TMA methods can be directly detected when the newly synthesized strands comprise detectable labels, either by incorporation into the primers or by incorporation of modified labeled nucleotides into the growing strand. Alternatively, indirect detection of unlabelled strands (which now serve as “targets” in the detection mode) can occur using a variety of sandwich assay configurations. As will be appreciated by those in the art, any of the newly synthesized strands can serve as the “target” for form an assay complex on a surface with a capture probe. In NASBA and TMA, it is preferable to utilize the newly formed RNA strands as the target, as this is where significant amplification occurs.

In another aspect, Invader™ technology is used to detect and identify nucleotide sequence. This technology is based on structure-specific polymerases that cleave nucleic acids in a site-specific manner. Two probes are used: an “invader” probe and a “signaling” probe that adjacently hybridize to a target sequence with a non-complementary overlap. The enzyme cleaves at the overlap due to its recognition of the “tail”, and releases the “tail” with a label. This can then be detected. The Invader™ technology is described in U.S. Pat. Nos. 5,846,717; 5,614,402; 5,719,028; 5,541,311; and 5,843,669, all of which are hereby incorporated by reference.

In another aspect, products from an oligonucleotide ligation amplification (OLA) technique are detected in order to identify a nucleotide sequence of a polynucleotide molecule. As will be appreciated by those in the art, the ligation product can be detected in a variety of ways. In a preferred aspect of specific embodiments, the ligation reaction is run in solution. In this aspect, only one of the primers carries a detectable label, e.g., the first ligation probe, and the capture probe on the bead is substantially complementary to the other probe, e.g., the second ligation probe. In this way, unextended labeled ligation primers will not interfere with the assay. That is, in a preferred aspect of specific embodiments, the ligation product is detected by solid-phase oligonucleotide probes. The solid-phase probes are preferably complementary to at least a portion of the ligation product. In a preferred aspect, the solid-phase probe is complementary to the 5′ detection oligonucleotide portion of the ligation product. This substantially reduces or eliminates false signal generated by the optically-labeled 3′ primers. Preferably, detection is accomplished by removing the unligated 5′ detection oligonucleotide from the reaction before application to a capture probe. In one aspect, the unligated 5′ detection oligonucleotides are removed by digesting 3′ non-protected oligonucleotides with a 3′ exonuclease, such as, exonuclease I. The ligation products are protected from exo I digestion by including, for example, 4-phosphorothioate residues at their 3′ terminus, thereby, rendering them resistant to exonuclease digestion. The unligated detection oligonucleotides are not protected and are digested. Alternatively, the target nucleic acid is immobilized on a solid-phase surface and a ligation assay is performed and unligated oligonucleotides are removed by washing under appropriate stringency to remove unligated oligonucleotides. The ligated oligonucleotides are eluted from the target nucleic acid using denaturing conditions, such as, 0.1 N NaOH, and detected as described herein.

The detection of products from an LCR reaction can also occur directly, in the case where one or both of the primers comprises at least one detectable label, or indirectly, using sandwich assays, through the use of additional probes; that is, the ligated probes can serve as target sequences, and detection may utilize amplification probes, capture probes, capture extender probes, label probes, and label extender probes, etc.

In one aspect, if an invasive cleavage reaction is used to amplify polynucleotide molecules, the products of the reaction can be detected by designing the probes to utilize a fluorophore-quencher reaction. A signaling probe comprising both a fluorophore and a quencher is used, with the fluorophore and the quencher on opposite sides of the cleavage site. As will be appreciated by those in the art, these will be positioned closely together. Thus, in the absence of cleavage, very little signal is seen due to the quenching reaction. After cleavage, however, the distance between the two is large, and thus fluorescence can be detected. Upon assembly of an assay complex, comprising the target sequence, an invader probe, and a signaling probe, and the introduction of the cleavage enzyme, the cleavage of the complex results in the disassociation of the quencher from the complex, resulting in an increase in fluorescence. In this aspect, suitable fluorophore-quencher pairs are as known in the art. For example, suitable quencher molecules comprise DABCYL.

In a preferred aspect of specific embodiments, straight hybridization methods are used to elucidate the identity of the base at the detection position. Generally speaking, these techniques break down into two basic types of reactions: those that rely on competitive hybridization techniques, and those that discriminate using stringency parameters and combinations thereof.

In one aspect of specific embodiments, the use of competitive hybridization probes is done to elucidate either the identity of the nucleotide(s) at the detection position or the presence of a mismatch. For example, sequencing by hybridization has been described (Drmanac et al., Genomics 4:114 (1989); Koster et al., Nature Biotechnology 14:1123 (1996); U.S. Pat. Nos. 5,525,464; 5,202,231 and 5,695,940, among others, all of which are hereby expressly incorporated by reference in their entirety).

In one aspect of specific embodiments, a plurality of probes (sometimes referred to herein as “readout probes”) are used to identify the base at the detection position. In this aspect, each different readout probe comprises a different detection label (which, as outlined below, can be either a primary label or a secondary label) and a different base at the position that will hybridize to the detection position of the target sequence (herein referred to as the readout position) such that differential hybridization will occur. That is, all other parameters being equal, a perfectly complementary readout probe (a “match probe”) will in general be more stable and have a slower to disassociate than a probe comprising a mismatch (a “mismatch probe”) at any particular temperature. Accordingly, by using different readout probes, each with a different base at the readout position and each with a different label, the identification of the base at the detection position is elucidated. In a preferred aspect of specific embodiments, a set of readout probes are used, each comprising a different base at the readout position. In some aspects, each readout probe comprises a different label that is distinguishable from the others. In one aspect, the length and sequence of each readout probe is identical except for the readout position, although this need not be true in all embodiments.

Label Probes

As described above, in one aspect, an adaptor can comprise one or more binding sequences for a detectable tag, such as a label probe. In some aspects, label probes can be added to the concatemers to detect particular sequences. Label probes will hybridize to the label probe binding sequence and comprise at least one detectable label. Such labels include without limitation the direct or indirect attachment of radioactive moieties, fluorescent moieties, colorimetric moieties, chemiluminescent moieties, and the like.

In one aspect, one or more fluorescent dyes are used as labels for the label probes (also referred to herein as “oligonucleotide probes”), e.g., as disclosed by Menchen et al, U.S. Pat. No. 5,188,934 (4,7-dichlorofluorscein dyes); Begot et al, U.S. Pat. No. 5,366,860 (spectrally resolvable rhodamine dyes); Lee et al, U.S. Pat. No. 5,847,162 (4,7-dichlororhodamine dyes); Khanna et al, U.S. Pat. No. 4,318,846 (ether-substituted fluorescein dyes); Lee et al, U.S. Pat. No. 5,800,996 (energy transfer dyes); Lee et al, U.S. Pat. No. 5,066,580 (xanthene dyes): Mathies et al, U.S. Pat. No. 5,688,648 (energy transfer dyes); and the like. Labeling can also be carried out with quantum dots, as disclosed in the following patents and patent publications, incorporated herein by reference: U.S. Pat. Nos. 6,322,901; 6,576,291; 6,423,551; 6,251,303; 6,319,426; 6,426,513; 6,444,143; 5,990,479; 6,207,392; 2002/0045045; 2003/0017264; and the like. As used herein, the term “fluorescent signal generating moiety” means a signaling means which conveys information through the fluorescent absorption and/or emission properties of one or more molecules. Such fluorescent properties include fluorescence intensity, fluorescence life time, emission spectrum characteristics, energy transfer, and the like.

Commercially available fluorescent nucleotide analogues readily incorporated into label probes include, for example, Cy3-dCTP, Cy3-dUTP, Cy5-dCTP, Cy5-dUTP (Amersham Biosciences, Piscataway, N.J., USA), fluorescein-12-dUTP, tetramethylrhodamine-6-dUTP, Texas Red®-5-dUTP, Cascade Blue®-7-dUTP, BODIPY® FL-14-dUTP, BODIPY®R-14-dUTP, BODIPY® TR-14-dUTP, Rhodamine Green™-5-dUTP, Oregon Green® 488-5-dUTP, Texas Red®-12-dUTP, BODIPY® 630/650-14-dUTP, BODIPY® 650/665-14-dUTP, Alexa Fluor® 488-5-dUTP, Alexa Fluor® 532-5-dUTP, Alexa Fluor® 568-5-dUTP, Alexa Fluor® 594-5-dUTP, Alexa Fluor® 546-14-dUTP, fluorescein-12-UTP, tetramethylrhodamine-6-UTP, Texas Red®-5-UTP, Cascade Blue®-7-UTP, BODIPY® FL-14-UTP, BODIPY® TMR-14-UTP, BODIPY® TR-14-UTP, Rhodamine Green™-5-UTP, Alexa Fluor® 488-5-UTP, Alexa Fluor® 546-14-UTP (Molecular Probes, Inc. Eugene, Oreg., USA). Other fluorophores available for post-synthetic attachment include, inter alia, Alexa Fluor® 350, Alexa Fluor® 532, Alexa Fluor® 546, Alexa Fluor® 568, Alexa Fluor® 594, Alexa Fluor® 647, BODIPY 493/503, BODIPY FL, BODIPY R6G, BODIPY 530/550, BODIPY TMR, BODIPY 558/568, BODIPY 558/568, BODIPY 564/570, BODIPY 576/589, BODIPY 581/591, BODIPY 630/650, BODIPY 650/665, Cascade Blue, Cascade Yellow, Dansyl, lissamine rhodamine B, Marina Blue, Oregon Green 488, Oregon Green 514, Pacific Blue, rhodamine 6G, rhodamine green, rhodamine red, tetramethylrhodamine, Texas Red (available from Molecular Probes, Inc., Eugene, Oreg., USA), and Cy2, Cy3.5, Cy5.5, and Cy7 (Amersham Biosciences, Piscataway, N.J. USA, and others). FRET tandem fluorophores may also be used, such as PerCP-Cy5.5, PE-Cy5, PE-Cy5.5, PE-Cy7, PE-Texas Red, and APC-Cy7; also, PE-Alexa dyes (610, 647, 680) and APC-Alexa dyes. Biotin, or a derivative thereof, may also be used as a label on a detection oligonucleotide, and subsequently bound by a detectably labeled avidin/streptavidin derivative (e.g., phycoerythrin-conjugated streptavidin), or a detectably labeled anti-biotin antibody. Digoxigenin may be incorporated as a label and subsequently bound by a detectably labeled anti-digoxigenin antibody (e.g., fluoresceinated anti-digoxigenin). An aminoallyl-dUTP residue may be incorporated into a detection oligonucleotide and subsequently coupled to an N-hydroxy succinimide (NHS) derivitized fluorescent dye, such as those listed supra. In general, any member of a conjugate pair may be incorporated into a detection oligonucleotide provided that a detectably labeled conjugate partner can be bound to permit detection. As used herein, the term antibody refers to an antibody molecule of any class, or any subfragment thereof, such as an Fab. Other suitable labels for detection oligonucleotides may include fluorescein (FAM), digoxigenin, dinitrophenol (DNP), dansyl, biotin, bromodeoxyuridine (BrdU), hexahistidine (6×His), phosphor-amino acids (e.g., P-tyr, P-ser, P-thr), or any other suitable label. In one aspect the following hapten/antibody pairs are used for detection, in which each of the antibodies is derivatized with a detectable label: biotin/α-biotin, digoxigenin/α-digoxigenin, dinitrophenol (DNP)/α-DNP, 5-Carboxyfluorescein (FAM)/α-FAM. As described in schemes below, probes may also be indirectly labeled, especially with a hapten that is then bound by a capture agent, e.g., as disclosed in Holtke et al, U.S. Pat. Nos. 5,344,757; 5,702,888; and 5,354,657; Huber et al, U.S. Pat. No. 5,198,537; Miyoshi, U.S. Pat. No. 4,849,336; Misiura and Gait, PCT publication WO 91/17160; and the like. Many different hapten-capture agent pairs are available for use with the invention. Exemplary, haptens include, biotin, des-biotin and other derivatives, dinitrophenol, dansyl, fluorescein, CY5, and other dyes, digoxigenin, and the like. For biotin, a capture agent may be avidin, streptavidin, or antibodies. Antibodies may be used as capture agents for the other haptens (many dye-antibody pairs being commercially available, e.g., Molecular Probes).

In one aspect, pools of label probes are provided which preferably have from about 1 to about 3 bases, allowing for an even and optimized signal for different sequences at degenerate positions. In another aspect, a concentration adjusted mix of 3-mer building blocks is used in the probe synthesis.

Label probes may be prepared with nucleic acid tag tails instead of being directly labeled. Tails preferably do not interact with target polynucleotides. These tails may be prepared from natural bases or modified bases such as isoC and isoG that pair only between themselves. If isoC and isoG nucleotides are used, the sequences may be separately synthesized with a 5′ amino-linker, which allows conjugation to a 5′ carboxy modified linker that is synthesized on to each tagged probe. This allows separately synthesized tag sequences to be combined with known probes while they are still attached to the column. In one aspect, 21 tagged sequences are used in combination with 1024 known probes.

The tails may be separated from probes by 1-3 or more degenerated bases, abasic sites or other linkers. One approach to minimize interaction of tails and target DNA is to use sequences that are very infrequent in the target DNA. For example, CGCGATATCGCGATAT_((SEQ ID NO: 10)) or CGATCGATCGAT_((SEQ ID NO: 11)) is expected to be infrequent in mammalian genomes. One option is to use probe with tails pre-hybridized with unlabeled tags that would be denatured and maybe washed away after ligation and before hybridization with labeled tags. Uracil may be used to generate degradable tails/tags and to remove them before running a new cycle instead of using temperature removal;

In one aspect high-plex multiplex ligation assays of probes are used which are not labeled with fluorescent dyes, thus reducing background and assay costs. For example for 8 colors 4×8=32 different encoding tails may be prepared and 32 probes as a pool may be used in hybridization/ligation. In the decoding process, four cycles each with 8 tags are used. Thus, each color is used for 4 tags used in 4 decoding cycles. After each cycle, tags may be removed or dyes photo bleached. The process requires that the last set of probes to be decoded has to stay hybridized through 4 decoding cycles.

In one aspect additional properties are included to provide the ability to distinguish different probes using the same color, for example Tm/stability, degradability by incorporated uracil bases and UDG enzyme, and chemically or photochemically cleavable bonds. A combination of two properties, such as temperature stability directly or after cutting or removing a stabilizer to provide 8 distinct tags for the same color; more than one cut type may be used to create 3 or more groups; to execute this 4-8 or 6-12 exposures of the same color may be required, demanding low photo-bleaching conditions such as low intensity light illumination that may be detected by intensified CCDs (ICCDs). For example if one property is melting temperature (Tm) and there are 4 tag-oligos or anchors or primers with distinct Tm, another set of 4 oligos can be prepared that has the first 4 probes connected to or intractable with a stabilizer that shifts the Tm of these 4 oligos above the most stable oligo in the first group without stabilizer. After resolving 4 oligos from the first group by consecutive melting off, the temperature may be reduced to the initial low level, the stabilizer may be cut or removed, and 4 tagged-oligos or anchors or primers can then be differentially melted using the same temperature points as for the first group.

In one aspect, probe-probe hybrids are stabilized through ligation to another unlabeled oligonucleotide, such as an anchor probe.

As mentioned above, random arrays of biomolecules, such as genomic DNA fragments or cDNA fragments, provides a platform for large scale sequence determination and for genome-wide measurements based on counting sequence tags, in a manner similar to measurements made by serial analysis of gene expression (SAGE) or massively parallel signature sequencing, e.g., Velculescu, et al, (1995), Science 270, 484-487; and Brenner et al (2000), Nature Biotechnology, 18: 630-634. Such genome-wide measurements include, but are not limited to, determination of polymorphisms, including nucleotide substitutions, deletions, and insertions, inversions, and the like, determination of methylation patterns, copy number patterns, and the like, such as could be carried out by a wide range of assays known to those with ordinary skill in the art, e.g., Syvanen (2005), Nature Genetics Supplement, 37: S5-S10; Gunderson et al (2005), Nature Genetics, 37: 549-554; Fan et al (2003), Cold Spring Harbor Symposia on Quantitative Biology, LXVIII: 69-78; and U.S. Pat. Nos. 4,883,750; 6,858,412; 5,871,921; 6,355,431; and the like, which are incorporated herein by reference.

Detection Instrumentation

As mentioned above, signals from single molecules on random arrays made in accordance with the invention are generated and detected by a number of detection systems, including, but not limited to, scanning electron microscopy, near field scanning optical microscopy (NSOM), total internal reflection fluorescence microscopy (TIRFM), and the like. Abundant guidance is found in the literature for applying such techniques for analyzing and detecting nanoscale structures on surfaces, as evidenced by the following references that are incorporated by reference: Reimer et al, editors, Scanning Electron Microscopy: Physics of Image Formation and Microanalysis, 2^(nd) Edition (Springer, 1998); Nie et al, Anal. Chem., 78: 1528-1534 (2006); Hecht et al, Journal Chemical Physics, 112: 7761-7774 (2000); Zhu et al, editors, Near-Field Optics: Principles and Applications (World Scientific Publishing, Singapore, 1999); Drmanac, International patent publication WO 2004/076683; Lehr et al, Anal. Chem., 75: 2414-2420 (2003); Neuschafer et al, Biosensors & Bioelectronics, 18: 489-497 (2003); Neuschafer et al, U.S. Pat. No. 6,289,144; and the like. Of particular interest is TIRFM, for example, as disclosed by Neuschafer et al, U.S. Pat. No. 6,289,144; Lehr et al (cited above); and Drmanac, International patent publication WO 2004/076683.

In one aspect, instruments for use with arrays of the invention comprise three basic components: (i) a fluidics system for storing and transferring detection and processing reagents, e.g., probes, wash solutions, and the like, to an array; (ii) a reaction chamber, or flow cell, holding or comprising an array and having flow-through and temperature control capability; and (iii) an illumination and detection system. In one aspect, a flow cell has a temperature control subsystem with ability to maintain temperature in the range from about 5-95° C., or more specifically 10-85° C., and can change temperature with a rate of about 0.5-2° C. per second.

In an exemplary aspect of specific embodiments, a 20× objective is used, and a 6 mm×6 mm array may require roughly 30 images for full coverage by using a 10 mega pixel camera. Each of 1 micrometer array areas is read by about 8 pixels. Each image is acquired in 250 ms, 150 ms for exposure and 100 ms to move the stage. Using this fast acquisition it will take ˜7.5 seconds to image each array, or 12 minutes to image the complete set of 96 arrays on each substrate. In one aspect of an imaging system, this high image acquisition rate is achieved by using four ten-megapixel cameras, each imaging the emission spectra of a different fluorophore. The cameras are coupled to the microscope through a series of dichroic beam splitters. The autofocus routine, which takes extra time, runs only if an acquired image is out of focus. It will then store the Z axis position information to be used upon return to that section of that array during the next imaging cycle. By mapping the autofocus position for each location on the substrate, it is possible to reduce the time required for image acquisition. Imaging speed may be improved by decreasing the objective magnification power, using grid patterned arrays and increasing the number of pixels of data collected in each image.

For example, up to four or more cameras may be used, preferably in the 10-16 megapixel range. Multiple band pass filters and dichroic mirrors may also be used to collect pixel data across up to four or more emission spectra. To compensate for the lower light collecting power of the decreased magnification objective, the power of the excitation light source can be increased. Throughput can be increased by using one or more flow chambers with each camera, so that the imaging system is not idle while the samples are being hybridized/reacted. Because the probing of arrays can be non-sequential, more than one imaging system can be used to collect data from a set of arrays, further decreasing assay time.

During the imaging process, the substrate must remain in focus. Some key factors in maintaining focus are the flatness of the substrate, orthogonality of the substrate to the focus plane, and mechanical forces on the substrate that may deform it. Substrate flatness can be well controlled, glass plates which have better than ¼ wave flatness are readily obtained. Uneven mechanical forces on the substrate can be minimized through proper design of the hybridization chamber. Orthogonality to the focus plane can be achieved by a well adjusted, high precision stage. Auto focus routines generally take additional time to run, so it is desirable to run them only if necessary. After each image is acquired, it will be analyzed using a fast algorithm to determine if the image is in focus. If the image is out of focus, the auto focus routine will run. It will then store the objectives Z position information to be used upon return to that section of that array during the next imaging cycle. By mapping the objectives Z position at various locations on the substrate, we will reduce the time required for substrate image acquisition.

A suitable illumination and detection system for fluorescence-based signal is a Zeiss Axiovert 200 equipped with a TIRF slider coupled to a 80 milliwatt 532 nm solid state laser. The slider illuminates the substrate through the objective at the correct TIRF illumination angle. TIRF can also be accomplished without the use of the objective by illuminating the substrate though a prism optically coupled to the substrate. Planar wave guides can also be used to implement TIRF on the substrate. Epi illumination can also be employed. The light source can be rastered, spread beam, coherent, incoherent, and originate from a single or multi-spectrum source.

One aspect for the imaging system contains a 20× lens with a 1.25 mm field of view, with detection being accomplished with a 10 megapixel camera. Such a system images approx 1.5 million concatemers attached to the patterned array at 1 micron pitch. Under this configuration there are approximately 6.4 pixels per concatemer. The number of pixels per concatemer can be adjusted by increasing or decreasing the field of view of the objective. For example a 1 mm field of view would yield a value of 10 pixels per concatemer and a 2 mm field of view would yield a value of 2.5 pixels per concatemer. The field of view may be adjusted relative to the magnification and NA of the objective to yield the lowest pixel count per concatemer that is still capable of being resolved by the optics, and image analysis software.

Both TIRF and EPI illumination allow for almost any light source to be used. One illumination schema is to share a common set of monochromatic illumination sources (about 4 lasers for 6-8 colors) amongst imagers. Each imager collects data at a different wavelength at any given time and the light sources would be switched to the imagers via an optical switching system. In such an aspect, the illumination source preferably produces at least 6, but more preferably 8 different wavelengths. Such sources include gas lasers, multiple diode pumped solid state lasers combined through a fiber coupler, filtered Xenon Arc lamps, tunable lasers, or the more novel Spectralum Light Engine, soon to be offered by Tidal Photonics. The Spectralum Light Engine uses prism to spectrally separate light. The spectrum is projected onto a Texas Instruments Digital Light Processor, which can selectively reflect any portion of the spectrum into a fiber or optical connector. This system is capable of monitoring and calibrating the power output across individual wavelengths to keep them constant so as to automatically compensate for intensity differences as bulbs age or between bulb changes.

Successfully scoring 6 billion concatemers through ˜350 (˜60 per color) images per region over 24 hours may require a combination of parallel image acquisition, increased image acquisition speed, and increased field of view for each imager. Additionally, the imager may support between six to eight colors. Commercially available microscopes commonly image a ˜1 mm field of view at 20× magnification with an NA of 0.8. At the proposed concatemer pitch of 0.5 micron, this translates into roughly 4 million concatemers per image. This yields approximately 1,500 images for 6 billion spots per hybridization cycle, or 0.5 million images for 350 imaging cycles. In a large scale sequencing operation, each imager preferably acquires ˜200,000 images per day, based on a 300 millisecond exposure time to a 16 mega pixel CCD. Thus, a preferred instrument design is 4 imager modules each serving 4 flow cells (16 flow cells total). The above described imaging schema assumes that each imager has a CCD detector with 10 million pixels and be used with an exposure time of roughly 300 milliseconds. This should be an acceptable method for collecting data for 6 fluorophore labels. One possible drawback to this imaging technique is that certain fluorophores may be unintentionally photo bleached by the light source while other fluorophores are being imaged. Keeping the illumination power low and exposure times to a minimum would greatly reduce photo bleaching. By using intensified CCDs (ICCDs) data could be collected of roughly the same quality with illumination intensities and exposure times that are orders of magnitude lower than standard CCDs. ICCDs are generally available in the 1-1.4 megapixel range. Because they require much shorter exposure times, a one megapixel ICCD can acquire ten or more images in the time a standard CCD acquires a single image. Used in conjunction with fast filter wheels, and a high speed flow cell stage, a one mega pixel ICCD should be able to collect the same amount of data as a 10 megapixel standard CCD.

Kits of the Invention

In the commercialization of the methods and compositions described herein, certain kits for construction of deletion mate pairs, of deletion mate pair constructs, and of random arrays of deletion mate pair constructs or deletion mate pair construct amplicons are provided as kits of the invention. Kits for using deletion mate pair constructs, for creating deletion mate pair construct amplicons, and for using the same for various applications are particularly useful. In general, kits of the invention can include any deletion mate pair, deletion mate pair construct, amplicon of deletion mate pair construct, and random arrays as described herein, as well as reagents and molecules for creating such constructs and arrays.

In one aspect, kits of the invention include elements for selecting for desired orientations of multiple adaptors in library constructs. Such kits can include without limitation the following elements: (a) a first adaptor, which includes a recognition site for a first Type IIs restriction endonuclease; a second adaptor, which includes a restriction site for a second Type IIs restriction endonuclease; and (c) primers complementary to both ends of each of the first and second adaptors. In one embodiment, the adaptors included in such kits are single stranded. In another embodiment, the adaptors are double stranded. Such kits may also include combinations of single and double stranded adaptors.

In another aspect, the invention provides kits for the construction of deletion mate pair constructs. Such kits can include without limitation: (i) a plurality of circularization adaptors; (ii) a plurality of deletion adaptors, wherein the deletion adaptors comprise at least one recognition site for a restriction endonuclease; (iii) a plurality of restriction endonucleases which correspond to the recognition sites of the deletion adaptors; and (iv) ligases and buffers and reagents for utilizing the ligases. The circularization and deletion adaptors may be double- or single-stranded. In a preferred embodiment, the deletion adaptors comprise recognition sites for exact cutters, and the restriction endonucleases included include such exact cutters. In another embodiment, the plurality of deletion adaptors comprise recognition sites for both exact and non-exact cutters, and the plurality restriction endonucleases include a combination of exact and non-exact cutters.

In a further embodiment, the kit described above also includes reagents for creating amplicons of the deletion mate pairs created using the methods described herein and/or the kit described above. In one embodiment, the kit includes reagents for conducting a rolling circle replication reaction, and the resultant amplicons are concatemers.

Kits for applications of random arrays of the invention include, but are not limited to, kits for determining the nucleotide sequence of a target polynucleotide, kits for large-scale identification of differences between reference DNA sequences and test DNA sequences, kits for profiling exons, and the like. A kit typically comprises at least one support having a surface and one or more reagents necessary or useful for constructing a random array of the invention or for carrying out an application therewith. Such reagents include, without limitation, nucleic acid primers, probes, adaptors, enzymes, and the like, and are each packaged in a container, such as, without limitation, a vial, tube or bottle, in a package suitable for commercial distribution, such as, without limitation, a box, a sealed pouch, a blister pack and a carton. The package typically contains a label or packaging insert indicating the uses of the packaged materials. As used herein, “packaging materials” includes any article used in the packaging for distribution of reagents in a kit, including without limitation containers, vials, tubes, bottles, pouches, blister packaging, labels, tags, instruction sheets and package inserts.

In one aspect, the invention provides a kit for making a random array of concatemers of DNA fragments from a source nucleic acid comprising the following components: (i) a support having a surface; and (ii) at least one adaptor for ligating to each DNA fragment and forming a DNA circle therewith, each DNA circle capable of being replicated by a rolling circle replication reaction to form a concatemer that is capable of being randomly disposed on the surface. In such kits, the surface may be a planar surface having an array of discrete regions, wherein each discrete region has a size equivalent to that of said concatemers. The discrete regions may form a regular array with a nearest neighbor distance in the range of from 0.1 to 20 μm. The concatemers on the discrete regions may have a nearest neighbor distance such that they are optically resolvable. The discrete regions may have capture probes attached and the adaptors may each have a region complementary to the capture oligonucleotides such that the concatemers are capable of being attached to the discrete regions by formation of complexes between the capture oligonucleotides and the complementary regions of the adaptor oligonucleotides. In some aspects, the concatemers are randomly distributed on said discrete regions and the nearest neighbor distance is in the range of from 0.3 to 3 μm.

Such kits may further comprise (a) a terminal transferase for attaching a homopolymer tail to said DNA fragments to provide a binding site for a first end of said adaptors, (b) a ligase for ligating a strand of said adaptor oligonucleotide to ends of said DNA fragment to form said DNA circle, (c) a primer for annealing to a region of the strand of said adaptors, and (d) a DNA polymerase for extending the primer annealed to the strand in a rolling circle replication reaction. The above adaptor oligonucleotide may have a second end having a number of degenerate bases in the range of from 4 to 12.

In still another aspect, the invention provides kits for constructing a single molecule array comprising the following components: (i) a support having a surface having reactive functionalities; and (ii) a plurality of macromolecular structures each having a unique functionality and multiple complementary functionalities, the macromolecular structures being capable of being attached randomly on the surface wherein the attachment is formed by one or more linkages formed by reaction of one or more reactive functionalities with one or more complementary functionalities; and wherein the unique functionality is capable of selectively reacting with a functionality on an analyte molecule to form the single molecule array. In a preferred aspect, the macromolecular structures comprise deletion mate pairs, deletion mate pair constructs, and/or amplicons of deletion mate pair constructs. In some aspects of such kits, the surface is a planar surface having an array of discrete regions containing said reactive functionalities and wherein each discrete region has an area less than 1 μm². In further aspects, the discrete regions form a regular array with a nearest neighbor distance in the range of from 0.1 to 20 μm. In further aspects, the concatemers on the discrete regions have a nearest neighbor distance such that they are optically resolvable. In still further aspects, the macromolecular structures may be concatemers of one or more DNA fragments and wherein the unique functionalities are at a 3′ end or a 5′ end of the concatemers.

While this invention has been disclosed with reference to specific aspects and embodiments, it is apparent that other embodiments and variations of this invention may be devised by others skilled in the art without departing from the true spirit and scope of the invention. Furthermore, methodologies and examples provided in U.S. patent application Ser. Nos. 11/451,691, filed Jun. 13, 2006; 11/451,692, filed Jun. 13, 2006 and 11/679,124, filed Feb. 26, 2007 are hereby incorporated by reference in their entirety for use with methods and compositions herein disclosed. 

1. A method for forming a polynucleotide comprising a deletion mate pair, the method comprising: (a) providing a first linear construct, wherein the linear construct comprises a first adaptor interposed between a first target polynucleotide fragment and a second target polynucleotide fragment, and wherein the first target polynucleotide fragment and the second target polynucleotide fragment are contiguous nucleic acids within a target polynucleotide; (b) ligating a deletion adaptor to the first linear construct to form a second linear construct, wherein the deletion adaptor comprises a recognition site for a restriction endonuclease, and wherein the restriction endonuclease cleaves at a known distance from said recognition site; (c) cleaving the second linear construct with the restriction endonuclease to form a third linear construct, thereby forming the polynucleotide comprising a deletion mate pair.
 2. The method of claim 1, further comprising: (d) providing a second adaptor to the cleaved second linear construct; and (e) ligating the adaptor to one end of the cleaved second linear construct to form a fourth linear construct.
 3. The method of claim 2, further comprising: (f) providing a third adaptor to the end of the fourth linear construct opposite the end with the ligated second adaptor; and (g) ligating the third adaptor to the fourth linear construct.
 4. The method of claim 3, further comprising; (h) circularizing the fourth linear construct.
 5. The method of claim 1, wherein the cleaving step (c) comprises deleting a known number of bases from the second target polynucleotide fragment.
 6. The method of claim 1, wherein at least one adaptor comprises two recognition sites for the restriction endonuclease on each end.
 7. A method for forming a polynucleotide comprising a deletion mate pair, the method comprising: (a) providing a first circular construct, wherein the construct comprises a first adaptor and a target polynucleotide, wherein the first adaptor comprises a recognition site for a first restriction endonuclease that cleaves at a known distance from the recognition site and a recognition site for a second restriction endonuclease that cleaves within the first adaptor; (b) cleaving the first circular construct with the first restriction endonuclease to form a first linear construct, (c) cleaving the first linear construct with the second restriction endonuclease to form a second linear construct; and (d) circularizing the second linear construct to create a second circular construct; thereby forming the polynucleotide comprising a deletion mate pair.
 8. The method of claim 7, further comprising repeating steps (b) through (d) on the second circular construct, thereby forming a third circular construct comprising a deletion mate pair.
 9. The method of claim 8, wherein steps (b) through (d) are repeated at least three times to create a series of constructs.
 10. The method of claim 8, wherein different second restriction endonucleases are used in cleaving step (c).
 11. The method of claim 8, wherein the same second restriction endonuclease is used to cleave in step (c).
 12. A method for forming a polynucleotide comprising a deletion mate pair, the method comprising: (a) providing a first linear construct comprising a target polynucleotide and an adaptor, wherein a first adaptor is attached to one end of the target polynucleotide; (b) ligating a deletion adaptor to the end of the first linear construct opposite the first adaptor, wherein the deletion adaptor comprises a recognition site for a restriction endonuclease that cleaves at a known distance from the recognition site; and (c) cleaving the first linear construct with the restriction endonuclease to form a second linear construct, thereby forming the polynucleotide comprising a deletion mate pair.
 13. The method of claim 12, further comprising: (d) providing a second adaptor to the cleaved second linear construct; and (e) ligating the second adaptor to the end of the second linear construct that is opposite of the first adaptor.
 14. The method of claim 12, further comprising; (f) circularizing the second linear construct.
 15. The method of claim 13, wherein steps (b) and (c) are repeated on the second linear construct.
 16. The method of claim 15, wherein steps (b) and (c) are repeated three or more times to create a series of constructs.
 17. A method for forming a polynucleotide comprising a deletion mate pair, the method comprising: (a) providing a first linear construct comprising a target polynucleotide; (b) ligating a deletion adaptor to one end of the linear construct, wherein the deletion adaptor comprises a recognition site for a restriction endonuclease that cleaves at a known distance from the recognition site; and (c) cleaving the first linear construct with the restriction endonuclease to form a second linear construct, thereby forming the polynucleotide comprising a deletion mate pair.
 18. The method of claim 17, further comprising (d) providing a first adaptor; and (e) ligating the first adaptor to the second linear construct to create a third linear construct.
 19. The method of claim 18, further comprising circularizing the third linear construct.
 20. The method of claim 18, further comprising: (d) providing a second adaptor; and (e) ligating the second adaptor to the second linear construct on the end opposite of the first adaptor.
 21. The method of claim 17, further comprising repeating steps (b) and (c) on the second linear construct, thereby forming a third linear construct with a deletion mate pair.
 22. The method of claim 21, wherein steps (b) and (c) are repeated at least three times to create a series of constructs.
 23. A method for forming a polynucleotide comprising a deletion mate pair, the method comprising: (a) providing a first circular construct, wherein the construct comprises a first adaptor and a target polynucleotide, wherein the first adaptor comprises a recognition site for a first restriction endonuclease that cleaves at a known distance from the recognition site; (b) cleaving the first circular construct with the first restriction endonuclease to form a first linear construct, (c) providing a second adaptor, wherein the second adaptor comprises a recognition site for a second restriction endonuclease that cleaves at a known distance from the recognition site; (d) ligating the second adaptor to one end of the first linear construct to create a second linear construct; and (e) circularizing the second linear construct to form a second circular construct; thereby forming the polynucleotide comprising a deletion mate pair.
 24. The method of claim 23, further comprising repeating steps (b) through (e) on the second circular construct, thereby forming a third circular construct with a deletion mate pair.
 25. The method of claim 24, wherein steps (b) through (e) are repeated at least three times to create a series of constructs.
 26. A method for analyzing a polynucleotide sequence, the method comprising: (a) providing a deletion mate pair construct, wherein the deletion mate pair construct comprises: i) a first adaptor, ii) a second adaptor, iii) a first target sequence, and iv) a second target sequence, wherein the first target sequence and the second target sequence span a portion of the polynucleotide sequence; (b) identifying at least one nucleotide of the first target sequence and at least one nucleotide of the second target sequence, thereby analyzing the polynucleotide sequence.
 27. The method of claim 26, wherein the at least one nucleotide of the first target sequence and the at least one nucleotide of the second target sequence are separated by a known distance in the polynucleotide sequence.
 28. The method of claim 26, wherein one or more of the first adaptor and the second adaptor comprises an anchor probe hybridization site, and the identifying step (b) comprises: i) hybridizing an anchor probe to the anchor probe hybridization site; ii) hybridizing sequencing probes to the first target sequence and the second target sequence; iii) ligating adjacent hybridized sequencing and anchor probes to form ligated probes; and iv) detecting the ligated probes to identify nucleotides of one or more of the first target sequence and the second target sequence.
 29. The method of claim 26, wherein one or more of the first adaptor and the second adaptor comprises an anchor probe hybridization site, and the identifying step (b) comprises: i) hybridizing an anchor probe to the anchor probe hybridization site; ii) adding a polymerase and at least one dNTP comprising a label under conditions wherein if the dNTP is perfectly complementary to a detection position, the dNTP is added to the anchor probe to form an extended probe, thereby creating an interrogation position of the extended probe; and iii) determining the nucleotide at the interrogation position of the extended probe.
 30. The method of claim 26, wherein the identifying step (b) comprises: (a) contacting the deletion mate pair construct with a set of sequencing probes, wherein each of the sequencing probes comprises: i) a first domain complementary to the first adaptor or to the second adaptor; ii) a unique nucleotide at a first interrogation position; and iii) a label; under conditions wherein if the unique nucleotide is complementary to a first nucleotide in the first target sequence or the second target sequence, a sequencing probe hybridizes to the deletion mate pair construct; and (b) detecting the label of the hybridized sequencing probe, thereby identifying a nucleotide in the first target sequence or the second target sequence.
 31. A method for forming a library of a plurality of circularized deletion mate pair constructs, the method comprising: (a) ligating a deletion adaptor to each of a plurality of first linear constructs, wherein the deletion adaptor comprises a recognition site for a restriction endonuclease, and wherein the restriction endonuclease cleaves at a known distance from the recognition site; (b) cleaving at least a portion of the plurality of first linear constructs with the restriction endonuclease to provide a plurality of second linear constructs; and (c) circularizing a plurality of the second linear constructs, thereby forming the library of circularized deletion mate pair constructs.
 32. The method of claim 31, wherein each of the plurality of first linear constructs comprises a circularization adaptor, wherein the circularization adaptor comprises a recognition site for a restriction endonuclease.
 33. The method of claim 32, wherein the plurality of first linear constructs is formed by a method comprising: (a) providing a plurality of circularized polynucleotides, wherein each of the circularized polynucleotides comprises a circularization adaptor, and wherein the circularization adaptor comprises a recognition site for a restriction endonuclease; (b) cleaving the plurality of circularized polynucleotides with the restriction endonuclease, thereby forming the plurality of first linear constructs.
 34. The method of claim 32 or 33, wherein the recognition site in the circularization adaptor is identical to the recognition site in the deletion adaptor.
 35. The method of claim 32 or 33, wherein the recognition site in the circularization adaptor is different from the recognition site in the deletion adaptor.
 36. The method of claim 31, wherein the known distance in step (a) is at least 20 bases.
 37. The method of claim 31, wherein step (b) is performed without purifying the product of step (a), and step (c) is performed without purifying the second linear constructs formed in step (b).
 38. A method for forming a random array, the method comprising: (a) providing a support comprising a surface; (b) providing a plurality of deletion mate pair constructs; (c) immobilizing the plurality of deletion mate pair constructs, thereby forming the random array.
 39. The method of claim 38, wherein: (i) the surface comprises capture probes; (ii) each of the plurality of deletion mate pair constructs comprises at least one adaptor, wherein the at least one adaptor comprises a sequence complementary to at least one capture probe; (iii) the immobilizing step comprises forming a duplex between the adaptor and the complementary capture probe,
 40. The method of claim 38, wherein: (i) the surface comprises reactive functionalities; (ii) each of the plurality of deletion mate pair constructs comprises at least one functionality complementary to at least one reactive functionality; and (iii) the immobilizing step comprises forming one or more linkages between at least one reactive functionality and a functionality complementary to that reactive functionality.
 41. The method of claim 38, wherein the immobilizing step occurs through non-covalent interactions between the plurality of deletion mate pair constructs and the surface.
 42. The method of claim 38, wherein the plurality of deletion mate pair constructs is formed by: (a) providing a plurality of circularized polynucleotides, wherein each of the circularized polynucleotides comprises a first adaptor, and wherein the first adaptor comprises a first recognition site for a first restriction endonuclease, (b) cleaving the plurality of circularized polynucleotides with the first restriction endonuclease to provide a plurality of first linear constructs; (c) ligating a second adaptor to the first linear constructs, wherein the second adaptor comprises a recognition site for a second restriction endonuclease, and wherein the second restriction endonuclease cleaves at a known distance from the recognition site; (d) cleaving the plurality of first linear constructs with the second restriction endonuclease to provide a plurality of second linear constructs, thereby forming the plurality of deletion mate pair constructs.
 43. An array made according to the method of claim
 38. 44. A library comprising a plurality of deletion mate pair constructs, wherein the plurality of deletion mate pair constructs comprises target sequences, and wherein the target sequences together represent at least about 80% of a genome.
 45. The library of claim 44, wherein each of the deletion mate pair constructs further comprises an adaptor, and the adaptor comprises a recognition site for a restriction endonuclease.
 46. The library of claim 45, wherein the adaptor further comprises a recognition site for a sequencing primer.
 47. The library of claim 45, wherein the adaptor is interposed between a first target sequence and a second target sequence.
 48. The library of claim 47, wherein the first target sequence and the second target sequence are of the same length.
 49. The library of claim 47, wherein the first target sequence and the second target sequence are of different lengths.
 50. The library of claim 47, wherein the first target sequence and the second target sequence each have a length of about 3 nucleotides to about 100 nucleotides.
 51. The library of claim 47, wherein the first target sequence and the second target sequence each have a length of about 8 nucleotides to about 24 nucleotides.
 52. A library comprising a plurality of circularized deletion mate pair constructs, wherein: (a) the plurality of deletion mate pair constructs comprises target sequences, and wherein the target sequences represent at least about 80% of a genome; (b) each of the plurality of circularized deletion mate pair constructs comprises a first adaptor, a first target sequence, and a second target sequence; and (c) the first target sequence and the second target sequence are separated by a known number of bases within the genome.
 53. A substrate comprising a plurality of immobilized concatemers, wherein: (a) each unit of the concatemer comprises a deletion mate pair construct, wherein the deletion mate pair construct comprises a first target sequence and a second target sequence; (b) the first target sequence and the second target sequence are derived from a target polynucleotide; and (c) the first target sequence and the second target sequence are separated by a known distance within the target polynucleotide.
 54. The substrate of claim 37, wherein the plurality of immobilized concatemers represents at least about 80% of a genome.
 55. An amplicon made by amplification of a circular library construct comprising target nucleic acid interspersed with a plurality of adaptors, wherein at least two sets of the adaptors are positioned on either side of a target polynucleotide of known length.
 56. The amplicon of claim 55, wherein the target polynucleotide between the two sets of adaptors are of the same known length.
 57. The amplicon of claim 55, wherein one or more of the adaptors comprises a restriction endonuclease recognition site.
 58. The amplicon of claim 57, wherein the restriction endonuclease recognition site is a Type IIs restriction endonuclease recognition site.
 59. The amplicon of claim 55, wherein each adaptor of the plurality of adaptors further comprises a different anchor primer binding site at a 5′ and 3′ end of each of the plurality of adaptors.
 60. A plurality of amplicons of circular library constructs, wherein each amplicon comprises target nucleic acid interspersed with a plurality of adaptors, wherein at least two sets of the adaptors are positioned on either side of a target polynucleotide of known length.
 61. The plurality of amplicons of claim 60, wherein each of the plurality of adaptors in each amplicon has been selected to have a desired orientation with respect to at least one other of the plurality of adaptors.
 62. The plurality of amplicons of claim 60, wherein the target nucleic acid is genomic DNA, cDNA or RNA, and wherein the plurality of amplicons comprises substantially all of genomic DNA, cDNA or RNA of interest.
 63. The plurality of amplicons of claim 60, wherein one or more of the adaptors comprises a restriction endonuclease recognition site.
 64. The plurality of amplicons of claim 63, wherein the restriction endonuclease recognition site is a Type IIs restriction endonuclease recognition site.
 65. The plurality of amplicons of claim 63, wherein each adaptor of the plurality of adaptors further comprise a different anchor primer binding site at a 5′ and 3′ end of each of the plurality of adaptors.
 66. A kit for selecting for desired orientations of multiple adaptors in library constructs, wherein said kit comprises: (a) a first double-stranded adaptor, wherein the first double-stranded adaptor comprises a recognition site for a first Type IIs restriction endonuclease; (b) a second double-stranded adaptor, wherein the second double-stranded adaptor comprises a restriction site for a second Type IIs restriction endonuclease; and (c) primers complementary to both ends of each of said first and second adaptors. 